jamescalam/youtube-transcriptions | ATYUN.COM 官网-人工智能教程资讯全方位服务平台

数据集:

jamescalam/youtube-transcriptions

任务:

对话

问答

文本检索

子任务:

open-domain-qa extractive-qa document-retrieval

语言:

计算机处理:

monolingual

大小:

10K<n<100K

语言创建人:

found

批注创建人:

no-annotation

源数据集:

original

其他:

youtube technical speech to text speech+to+text

许可:

afl-3.0

数据集介绍文件清单

中文

The YouTube transcriptions dataset contains technical tutorials (currently from James Briggs , Daniel Bourke , and AI Coffee Break ) transcribed using OpenAI's Whisper (large). Each row represents roughly a sentence-length chunk of text alongside the video URL and timestamp.

Note that each item in the dataset contains just a short chunk of text. For most use cases you will likely need to merge multiple rows to create more substantial chunks of text, if you need to do that, this code snippet will help:

from datasets import load_dataset

# first download the dataset
data = load_dataset(
    'jamescalam/youtube-transcriptions',
    split='train'
)

new_data = []  # this will store adjusted data

window = 6  # number of sentences to combine
stride = 3  # number of sentences to 'stride' over, used to create overlap

for i in range(0, len(data), stride):
    i_end = min(len(data)-1, i+window)
    if data[i]['title'] != data[i_end]['title']:
        # in this case we skip this entry as we have start/end of two videos
        continue
    # create larger text chunk
    text = ' '.join(data[i:i_end]['text'])
    # add to adjusted data list
    new_data.append({
        'start': data[i]['start'],
        'end': data[i_end]['end'],
        'title': data[i]['title'],
        'text': text,
        'id': data[i]['id'],
        'url': data[i]['url'],
        'published': data[i]['published']
    })

作者:

jamescalam

数据集大小:

76.09 MB