数据集:
jamescalam/youtube-transcriptions
语言:
en计算机处理:
monolingual大小:
10K<n<100K语言创建人:
found批注创建人:
no-annotation源数据集:
original许可:
afl-3.0The YouTube transcriptions dataset contains technical tutorials (currently from James Briggs , Daniel Bourke , and AI Coffee Break ) transcribed using OpenAI's Whisper (large). Each row represents roughly a sentence-length chunk of text alongside the video URL and timestamp.
Note that each item in the dataset contains just a short chunk of text. For most use cases you will likely need to merge multiple rows to create more substantial chunks of text, if you need to do that, this code snippet will help:
from datasets import load_dataset # first download the dataset data = load_dataset( 'jamescalam/youtube-transcriptions', split='train' ) new_data = [] # this will store adjusted data window = 6 # number of sentences to combine stride = 3 # number of sentences to 'stride' over, used to create overlap for i in range(0, len(data), stride): i_end = min(len(data)-1, i+window) if data[i]['title'] != data[i_end]['title']: # in this case we skip this entry as we have start/end of two videos continue # create larger text chunk text = ' '.join(data[i:i_end]['text']) # add to adjusted data list new_data.append({ 'start': data[i]['start'], 'end': data[i_end]['end'], 'title': data[i]['title'], 'text': text, 'id': data[i]['id'], 'url': data[i]['url'], 'published': data[i]['published'] })