数据集:
c4
语言:
en计算机处理:
multilingual大小:
100M<n<1B语言创建人:
found批注创建人:
no-annotation源数据集:
original预印本库:
arxiv:1910.10683许可:
odc-byA colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: " https://commoncrawl.org" .
This is the version prepared by AllenAI, hosted at this address: https://huggingface.co/datasets/allenai/c4
It comes in four variants:
The en.noblocklist variant is exactly the same as the en variant, except we turned off the so-called "badwords filter", which removes all documents that contain words from the lists at https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words .
C4 is mainly intended to pretrain language models and word representations.
The dataset is in English.
An example form the en config is:
{ 'url': 'https://klyq.com/beginners-bbq-class-taking-place-in-missoula/', 'text': 'Beginners BBQ Class Taking Place in Missoula!\nDo you want to get better at making delicious BBQ? You will have the opportunity, put this on your calendar now. Thursday, September 22nd join World Class BBQ Champion, Tony Balay from Lonestar Smoke Rangers. He will be teaching a beginner level class for everyone who wants to get better with their culinary skills.\nHe will teach you everything you need to know to compete in a KCBS BBQ competition, including techniques, recipes, timelines, meat selection and trimming, plus smoker and fire information.\nThe cost to be in the class is $35 per person, and for spectators it is free. Included in the cost will be either a t-shirt or apron and you will be tasting samples of each meat that is prepared.', 'timestamp': '2019-04-25T12:57:54Z' }
The data have several fields:
name | train | validation |
---|---|---|
en | 364868892 | 364608 |
en.noblocklist | 393391519 | 393226 |
en.noclean | ? | ? |
realnewslike | 13799838 | 13863 |
[More Information Needed]
C4 dataset is a collection of about 750GB of English-language text sourced from the public Common Crawl web scrape. It includes heuristics to extract only natural language (as opposed to boilerplate and other gibberish) in addition to extensive deduplication. You can find the code that has been used to build this dataset in c4.py by Tensorflow Datasets.
The dataset was explicitly designed to be English only: any page that was not given a probability of at least 99% of being English by langdetect was discarded.
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
AllenAI are releasing this dataset under the terms of ODC-BY. By using this, you are also bound by the Common Crawl terms of use in respect of the content contained in the dataset.
@article{2019t5, author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu}, title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer}, journal = {arXiv e-prints}, year = {2019}, archivePrefix = {arXiv}, eprint = {1910.10683}, }