数据集:

youtube_caption_corrections

中文

Dataset Card for YouTube Caption Corrections

Dataset Summary

This dataset is built from pairs of YouTube captions where both an auto-generated and a manually-corrected caption are available for a single specified language. It currently only in English, but scripts at repo support other languages. The motivation for creating it was from viewing errors in auto-generated captions at a recent virtual conference, with the hope that there could be some way to help correct those errors.

The dataset in the repo at https://github.com/2dot71mily/youtube_captions_corrections records in a non-destructive manner all the differences between an auto-generated and a manually-corrected caption for thousands of videos. The dataset here focuses on the subset of those differences which are mutual and have the same size in token length difference, which means it excludes token insertion or deletion differences between the two captions. Therefore dataset here remains a non-destructive representation of the original auto-generated captions, but excludes some of the differences that are found in the manually-corrected captions.

Supported Tasks and Leaderboards

  • token-classification : The tokens in default_seq are from the auto-generated YouTube captions. If diff_type is labeled greater than 0 at a given index, then the associated token in same index in the default_seq was found to be different to the token in the manually-corrected YouTube caption, and therefore we assume it is an error. A model can be trained to learn when there are errors in the auto-generated captions.

  • slot-filling : The correction_seq is sparsely populated with tokens from the manually-corrected YouTube captions in the locations where there was found to be a difference to the token in the auto-generated YouTube captions. These 'incorrect' tokens in the default_seq can be masked in the locations where diff_type is labeled greater than 0 , so that a model can be trained to hopefully find a better word to fill in, rather than the 'incorrect' one.

End to end, the models could maybe first identify and then replace (with suitable alternatives) errors in YouTube and other auto-generated captions that are lacking manual corrections

Languages

English

Dataset Structure

Data Instances

If diff_type is labeled greater than 0 at a given index, then the associated token in same index in the default_seq was found to have a difference to the token in the manually-corrected YouTube caption. The correction_seq is sparsely populated with tokens from the manually-corrected YouTube captions at those locations of differences.

diff_type labels for tokens are as follows: 0: No difference 1: Case based difference, e.g. hello vs Hello 2: Punctuation difference, e.g. hello vs hello 3: Case and punctuation difference, e.g. hello vs Hello, 4: Word difference with same stem, e.g. thank vs thanked 5: Digit difference, e.g. 2 vs two 6: Intra-word punctuation difference, e.g. autogenerated vs auto-generated 7: Unknown type of difference, e.g. laughter vs draft 8: Reserved for unspecified difference

{ 'video_titles': '_QUEXsHfsA0', 'default_seq': ['you', 'see', "it's", 'a', 'laughter', 'but', 'by', 'the', 'time', 'you', 'see', 'this', 'it', "won't", 'be', 'so', 'we', 'have', 'a', 'big'] 'correction_seq': ['', 'see,', '', '', 'draft,', '', '', '', '', '', 'read', 'this,', '', '', 'be.', 'So', '', '', '', ''] 'diff_type': [0, 2, 0, 0, 7, 0, 0, 0, 0, 0, 7, 2, 0, 0, 2, 1, 0, 0, 0, 0] }

Data Fields

  • 'video_ids': Unique ID used by YouTube for each video. Can paste into https://www.youtube.com/watch?v=<{video_ids} to see video
  • 'default_seq': Tokenized auto-generated YouTube captions for the video
  • 'correction_seq': Tokenized manually-corrected YouTube captions only at those locations, where there is a difference between the auto-generated and manually-corrected captions
  • 'diff_type': A value greater than 0 at every token where there is a difference between the auto-generated and manually-corrected captions

Data Splits

No data splits

Dataset Creation

Curation Rationale

It was created after viewing errors in auto-generated captions at a recent virtual conference, with the hope that there could be some way to help correct those errors.

Source Data

Initial Data Collection and Normalization

All captions are requested via googleapiclient and youtube_transcript_api at the channel_id and language granularity, using scripts written at https://github.com/2dot71mily/youtube_captions_corrections .

The captions are tokenized on spaces and the manually-corrected sequence has here been reduced to only include differences between it and the auto-generated sequence.

Who are the source language producers?

Auto-generated scripts are from YouTube and the manually-corrected scripts are from creators, and any support they may have (e.g. community or software support)

Annotations

Annotation process

Scripts at repo, https://github.com/2dot71mily/youtube_captions_corrections take a diff of the two captions and use this to create annotations.

Who are the annotators?

YouTube creators, and any support they may have (e.g. community or software support)

Personal and Sensitive Information

All content publicly available on YouTube

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

Emily McMilin

Licensing Information

MIT License

Citation Information

https://github.com/2dot71mily/youtube_captions_corrections

Contributions

Thanks to @2dot71mily for adding this dataset.