数据集:
yhavinga/mc4_nl_cleaned
任务:
文本生成子任务:
language-modeling语言创建人:
found批注创建人:
no-annotation源数据集:
extended预印本库:
arxiv:1910.10683许可:
odc-byA cleaned version (151GB) of the Dutch part (277GB) of the C4 multilingual dataset (mC4). While this dataset is monolingual, it is possible to download en-nl interleaved data, see the Dataset Config section below. Based on the Common Crawl dataset . The original version was prepared by AllenAI , hosted at the address https://huggingface.co/datasets/allenai/c4 .
The Dutch portion of mC4 was cleaned in a similar fashion as the English cleaned C4 version. See GitLab for details.
In summary, the preprocessing procedure includes:
Removing documents containing words from a selection of the Dutch and English List of Dirty Naught Obscene and Otherwise Bad Words .
Removing sentences containing:
Less than 3 words.
A word longer than 250 characters.
An end symbol not matching end-of-sentence punctuation.
Strings associated to javascript code (e.g. { ), lorem ipsum, policy information in Dutch or English.
Removing documents (after sentence filtering):
Containing less than 5 sentences.
Containing less than 500 or more than 50'000 characters.
Not identified as prevalently Dutch by the LangDetect package.
Using parallel processing with 96 CPU cores on a TPUv3 via Google Cloud to perform the complete clean of all the original Dutch shards of mC4 (1024 of ~220Mb train, 4 of ~24Mb validation) required roughly 10 hours due to the demanding steps of sentence tokenization and language detection. The total size of compressed .json.gz files is roughly halved after the procedure.
An example from the dataset:
{ 'timestamp': '2019-02-22T15:37:25Z', 'url': 'https://ondernemingen.bnpparibasfortis.be/nl/artikel?n=vijf-gouden-tips-voor-succesvol-zaken-doen-met-japan', 'text': 'Japanse bedrijven zijn niet alleen hondstrouw aan hun leveranciers , ze betalen ook nog eens erg stipt. Alleen is het niet zo makkelijk er een voet tussen de deur te krijgen. Met de volgende tips hebt u alvast een streepje voor.\nIn Japan draait alles om vertrouwen. Neem voldoende tijd om een relatie op te bouwen.Aarzel niet om tijdig een lokale vertrouwenspersoon in te schakelen.\nJapan is een erg competitieve markt.Kwaliteit en prijs zijn erg belangrijk, u zult dus het beste van uzelf moeten geven. Gelukkig is de beloning groot. Japanse zakenlui zijn loyaal en betalen stipt!\nJapanners houden er eigenzinnige eisen op na. Kom dus niet aanzetten met uw standaardproducten voor de Europese markt. Zo moet een producent van diepvriesfrieten bijvoorbeeld perfect identieke frietjes kunnen leveren in mini- verpakkingen. Het goede nieuws is dat Japanners voor kwaliteit graag diep in hun buidel tasten.\nEn u dacht dat Europa lijdt aan reglementitis? Japanners kennen er ook wat van. Tal van voorschriften zeggen wat je wel en niet mag doen. Gelukkig zijn de regels helder geformuleerd.\nHet gebruik van het Engels is niet echt ingeburgerd in Japan. Betrek een tolk bij uw onderhandelingen en zorg voor correcte vertalingen van handleidingen of softwareprogramma’s.' }
The data contains the following fields:
To build mC4, the original authors used CLD3 to identify over 100 languages. For Dutch, the whole corpus of scraped text was divided in 1032 jsonl files, 1024 for training following the naming style c4-nl-cleaned.tfrecord-0XXXX-of-01024.json.gz and 4 for validation following the naming style c4-nl-cleaned.tfrecord-0000X-of-00004.json.gz . The full set of pre-processed files takes roughly 208GB of disk space to download with Git LFS.
For ease of use under different storage capacities, the following incremental configs are available: (note: files on disk are compressed)
config | train size (docs, words, download + preproc disk space) | validation size |
---|---|---|
micro | 125k docs, 23M words (<1GB) | 16k docs |
tiny | 6M docs, 2B words (6 GB + 15 GB) | 16k docs |
small | 15M docs, 6B words (14 GB + 36 GB) | 16k docs |
medium | 31M docs, 12B words (28 GB + 72 GB) | 32k docs |
large | 47M docs, 19B words (42 GB + 108 GB) | 48k docs |
full | 64M docs, 25B words (58 GB + 148 GB) | 64k docs |
For each config above there also exists a config <name>_en_nl that interleaves nl and en examples from the cleaned en variant of C4.
You can load any config like this:
from datasets import load_dataset datasets = load_dataset('yhavinga/mc4_nl_cleaned', 'tiny', streaming=True) print(datasets)
This will print
DatasetDict({ train: Dataset({ features: ['text', 'timestamp', 'url'], num_rows: 6303893 }) validation: Dataset({ features: ['text', 'timestamp', 'url'], num_rows: 16189 }) })
Since the configs are quite large, you may want to traverse them using the streaming mode available starting from — Datasets v1.9.0:
from datasets import load_dataset mc4_nl_full_stream = load_dataset('yhavinga/mc4_nl_cleaned', "full", split='train', streaming=True) print(next(iter(mc4_nl_full_stream))) # Prints the example presented above
Refer to the original paper for more considerations regarding the choice of sources and the scraping process for creating mC4 .
With more than 151GB (58GB compressed) of cleaned Dutch text and more than 23B estimated words, this is by far the largest available cleaned corpus for the Dutch language. The second largest dataset available is OSCAR , which is only 39GB in size for its deduplicated variant, and contains vulgarity. Using this corpus for training language models with adequate computational resources will allow researchers to reach parity with the performances observed for the English language. This can in turn have important repercussions for the development of commercial language technology applications for the Dutch language.
Despite the cleaning procedure aimed at removing vulgarity and profanity, it must be considered that model trained on this scraped corpus will inevitably reflect biases present in blog articles and comments on the Internet. This makes the corpus especially interesting in the context of studying data biases and how to limit their impacts.
AllenAI are releasing this dataset under the terms of ODC-BY. By using this, you are also bound by the Common Crawl terms of use in respect of the content contained in the dataset.
@article{2019t5, author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu}, title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer}, journal = {arXiv e-prints}, year = {2019}, archivePrefix = {arXiv}, eprint = {1910.10683}, }
Thanks to gabriele.sarti996@gmail.com , @dirkgr and @lhoestq for providing the cleaned_it_mc4 example that shows how upload a dataset to the Huggingface hub.