数据集:
youtube_caption_corrections
子任务:
slot-filling语言:
en计算机处理:
monolingual大小:
10K<n<100K语言创建人:
machine-generated源数据集:
original许可:
mitThis dataset is built from pairs of YouTube captions where both an auto-generated and a manually-corrected caption are available for a single specified language. It currently only in English, but scripts at repo support other languages. The motivation for creating it was from viewing errors in auto-generated captions at a recent virtual conference, with the hope that there could be some way to help correct those errors.
The dataset in the repo at https://github.com/2dot71mily/youtube_captions_corrections records in a non-destructive manner all the differences between an auto-generated and a manually-corrected caption for thousands of videos. The dataset here focuses on the subset of those differences which are mutual and have the same size in token length difference, which means it excludes token insertion or deletion differences between the two captions. Therefore dataset here remains a non-destructive representation of the original auto-generated captions, but excludes some of the differences that are found in the manually-corrected captions.
token-classification : The tokens in default_seq are from the auto-generated YouTube captions. If diff_type is labeled greater than 0 at a given index, then the associated token in same index in the default_seq was found to be different to the token in the manually-corrected YouTube caption, and therefore we assume it is an error. A model can be trained to learn when there are errors in the auto-generated captions.
slot-filling : The correction_seq is sparsely populated with tokens from the manually-corrected YouTube captions in the locations where there was found to be a difference to the token in the auto-generated YouTube captions. These 'incorrect' tokens in the default_seq can be masked in the locations where diff_type is labeled greater than 0 , so that a model can be trained to hopefully find a better word to fill in, rather than the 'incorrect' one.
End to end, the models could maybe first identify and then replace (with suitable alternatives) errors in YouTube and other auto-generated captions that are lacking manual corrections
English
If diff_type is labeled greater than 0 at a given index, then the associated token in same index in the default_seq was found to have a difference to the token in the manually-corrected YouTube caption. The correction_seq is sparsely populated with tokens from the manually-corrected YouTube captions at those locations of differences.
diff_type labels for tokens are as follows: 0: No difference 1: Case based difference, e.g. hello vs Hello 2: Punctuation difference, e.g. hello vs hello 3: Case and punctuation difference, e.g. hello vs Hello, 4: Word difference with same stem, e.g. thank vs thanked 5: Digit difference, e.g. 2 vs two 6: Intra-word punctuation difference, e.g. autogenerated vs auto-generated 7: Unknown type of difference, e.g. laughter vs draft 8: Reserved for unspecified difference
{ 'video_titles': '_QUEXsHfsA0', 'default_seq': ['you', 'see', "it's", 'a', 'laughter', 'but', 'by', 'the', 'time', 'you', 'see', 'this', 'it', "won't", 'be', 'so', 'we', 'have', 'a', 'big'] 'correction_seq': ['', 'see,', '', '', 'draft,', '', '', '', '', '', 'read', 'this,', '', '', 'be.', 'So', '', '', '', ''] 'diff_type': [0, 2, 0, 0, 7, 0, 0, 0, 0, 0, 7, 2, 0, 0, 2, 1, 0, 0, 0, 0] }
No data splits
It was created after viewing errors in auto-generated captions at a recent virtual conference, with the hope that there could be some way to help correct those errors.
All captions are requested via googleapiclient and youtube_transcript_api at the channel_id and language granularity, using scripts written at https://github.com/2dot71mily/youtube_captions_corrections .
The captions are tokenized on spaces and the manually-corrected sequence has here been reduced to only include differences between it and the auto-generated sequence.
Who are the source language producers?Auto-generated scripts are from YouTube and the manually-corrected scripts are from creators, and any support they may have (e.g. community or software support)
Scripts at repo, https://github.com/2dot71mily/youtube_captions_corrections take a diff of the two captions and use this to create annotations.
Who are the annotators?YouTube creators, and any support they may have (e.g. community or software support)
All content publicly available on YouTube
[More Information Needed]
[More Information Needed]
[More Information Needed]
Emily McMilin
MIT License
https://github.com/2dot71mily/youtube_captions_corrections
Thanks to @2dot71mily for adding this dataset.