数据集:
c4
语言:
en计算机处理:
multilingual大小:
100M<n<1B语言创建人:
found批注创建人:
no-annotation源数据集:
original预印本库:
arxiv:1910.10683许可:
odc-by这是Common Crawl网络爬行语料库的巨大清理版本。基于Common Crawl数据集:“ https://commoncrawl.org" ”。
这是由AllenAI准备的版本,托管在此地址: https://huggingface.co/datasets/allenai/c4 。
它有四个变体:
en.noblocklist变体与en变体完全相同,只是我们关闭了所谓的“badwords过滤器”,该过滤器会删除所有包含在 https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words 中的列表中的单词的文档。
C4主要用于预训练语言模型和单词表示。
数据集是英文。
一个来自en配置的示例是:
{ 'url': 'https://klyq.com/beginners-bbq-class-taking-place-in-missoula/', 'text': 'Beginners BBQ Class Taking Place in Missoula!\nDo you want to get better at making delicious BBQ? You will have the opportunity, put this on your calendar now. Thursday, September 22nd join World Class BBQ Champion, Tony Balay from Lonestar Smoke Rangers. He will be teaching a beginner level class for everyone who wants to get better with their culinary skills.\nHe will teach you everything you need to know to compete in a KCBS BBQ competition, including techniques, recipes, timelines, meat selection and trimming, plus smoker and fire information.\nThe cost to be in the class is $35 per person, and for spectators it is free. Included in the cost will be either a t-shirt or apron and you will be tasting samples of each meat that is prepared.', 'timestamp': '2019-04-25T12:57:54Z' }
数据有几个字段:
name | train | validation |
---|---|---|
en | 364868892 | 364608 |
en.noblocklist | 393391519 | 393226 |
en.noclean | ? | ? |
realnewslike | 13799838 | 13863 |
[需要更多信息]
C4数据集是从公共的Common Crawl网络抓取中获取的约750GB英文文本的集合。它包括只提取自然语言(而不是样板和其他废话)的启发式规则以及广泛的去重。您可以在 c4.py 中找到用于构建此数据集的代码,由Tensorflow Datasets提供。
数据集明确设计为仅包含英文:任何页面,如果它不能以至少99%的概率确定是英文的,就会被丢弃。
谁是源语言的生产者?[需要更多信息]
[需要更多信息]
谁是注释者?[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
AllenAI根据ODC-BY的条款发布此数据集。使用此数据集,您还受限于Common Crawl使用条款中包含的内容。
@article{2019t5, author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu}, title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer}, journal = {arXiv e-prints}, year = {2019}, archivePrefix = {arXiv}, eprint = {1910.10683}, }