数据集:

yhavinga/mc4_nl_cleaned

任务:

文本生成

子任务:

language-modeling

语言:

计算机处理:

monolingual en-nl

语言创建人:

found

批注创建人:

no-annotation

源数据集:

extended

预印本库:

arxiv:1910.10683

许可:

odc-by

数据集介绍文件清单

中文

Dataset Card for Clean Dutch mC4

Dataset Summary

A cleaned version (151GB) of the Dutch part (277GB) of the C4 multilingual dataset (mC4). While this dataset is monolingual, it is possible to download en-nl interleaved data, see the Dataset Config section below. Based on the Common Crawl dataset . The original version was prepared by AllenAI , hosted at the address https://huggingface.co/datasets/allenai/c4 .

Preprocessing

The Dutch portion of mC4 was cleaned in a similar fashion as the English cleaned C4 version. See GitLab for details.

In summary, the preprocessing procedure includes:

Removing documents containing words from a selection of the Dutch and English List of Dirty Naught Obscene and Otherwise Bad Words .
Removing sentences containing:
- Less than 3 words.
- A word longer than 250 characters.
- An end symbol not matching end-of-sentence punctuation.
- Strings associated to javascript code (e.g. { ), lorem ipsum, policy information in Dutch or English.
Removing documents (after sentence filtering):
- Containing less than 5 sentences.
- Containing less than 500 or more than 50'000 characters.
- Not identified as prevalently Dutch by the LangDetect package.

Using parallel processing with 96 CPU cores on a TPUv3 via Google Cloud to perform the complete clean of all the original Dutch shards of mC4 (1024 of ~220Mb train, 4 of ~24Mb validation) required roughly 10 hours due to the demanding steps of sentence tokenization and language detection. The total size of compressed .json.gz files is roughly halved after the procedure.

Dataset Structure

Data Instances

An example from the dataset:

{
  'timestamp': '2019-02-22T15:37:25Z',
  'url': 'https://ondernemingen.bnpparibasfortis.be/nl/artikel?n=vijf-gouden-tips-voor-succesvol-zaken-doen-met-japan',
  'text': 'Japanse bedrijven zijn niet alleen hondstrouw aan hun leveranciers , ze betalen ook nog eens erg stipt. Alleen is het niet zo makkelijk er een voet tussen de deur te krijgen. Met de volgende tips hebt u alvast een streepje voor.\nIn Japan draait alles om vertrouwen. Neem voldoende tijd om een relatie op te bouwen.Aarzel niet om tijdig een lokale vertrouwenspersoon in te schakelen.\nJapan is een erg competitieve markt.Kwaliteit en prijs zijn erg belangrijk, u zult dus het beste van uzelf moeten geven. Gelukkig is de beloning groot. Japanse zakenlui zijn loyaal en betalen stipt!\nJapanners houden er eigenzinnige eisen op na. Kom dus niet aanzetten met uw standaardproducten voor de Europese markt. Zo moet een producent van diepvriesfrieten bijvoorbeeld perfect identieke frietjes kunnen leveren in mini- verpakkingen. Het goede nieuws is dat Japanners voor kwaliteit graag diep in hun buidel tasten.\nEn u dacht dat Europa lijdt aan reglementitis? Japanners kennen er ook wat van. Tal van voorschriften zeggen wat je wel en niet mag doen. Gelukkig zijn de regels helder geformuleerd.\nHet gebruik van het Engels is niet echt ingeburgerd in Japan. Betrek een tolk bij uw onderhandelingen en zorg voor correcte vertalingen van handleidingen of softwareprogramma’s.'
}

Data Fields

The data contains the following fields:

url : url of the source as a string
text : text content as a string
timestamp : timestamp of extraction as a string

Data Configs

To build mC4, the original authors used CLD3 to identify over 100 languages. For Dutch, the whole corpus of scraped text was divided in 1032 jsonl files, 1024 for training following the naming style c4-nl-cleaned.tfrecord-0XXXX-of-01024.json.gz and 4 for validation following the naming style c4-nl-cleaned.tfrecord-0000X-of-00004.json.gz . The full set of pre-processed files takes roughly 208GB of disk space to download with Git LFS.

For ease of use under different storage capacities, the following incremental configs are available: (note: files on disk are compressed)

config	train size (docs, words, download + preproc disk space)	validation size
micro	125k docs, 23M words (<1GB)	16k docs
tiny	6M docs, 2B words (6 GB + 15 GB)	16k docs
small	15M docs, 6B words (14 GB + 36 GB)	16k docs
medium	31M docs, 12B words (28 GB + 72 GB)	32k docs
large	47M docs, 19B words (42 GB + 108 GB)	48k docs
full	64M docs, 25B words (58 GB + 148 GB)	64k docs

For each config above there also exists a config <name>_en_nl that interleaves nl and en examples from the cleaned en variant of C4.

You can load any config like this:

from datasets import load_dataset

datasets = load_dataset('yhavinga/mc4_nl_cleaned', 'tiny', streaming=True)
print(datasets)

This will print

DatasetDict({
    train: Dataset({
        features: ['text', 'timestamp', 'url'],
        num_rows: 6303893
    })
    validation: Dataset({
        features: ['text', 'timestamp', 'url'],
        num_rows: 16189
    })
})

Since the configs are quite large, you may want to traverse them using the streaming mode available starting from — Datasets v1.9.0:

from datasets import load_dataset

mc4_nl_full_stream = load_dataset('yhavinga/mc4_nl_cleaned', "full", split='train', streaming=True)
print(next(iter(mc4_nl_full_stream))) # Prints the example presented above

Dataset Creation

Refer to the original paper for more considerations regarding the choice of sources and the scraping process for creating mC4 .

Considerations for Using the Data

Social Impact of Dataset

With more than 151GB (58GB compressed) of cleaned Dutch text and more than 23B estimated words, this is by far the largest available cleaned corpus for the Dutch language. The second largest dataset available is OSCAR , which is only 39GB in size for its deduplicated variant, and contains vulgarity. Using this corpus for training language models with adequate computational resources will allow researchers to reach parity with the performances observed for the English language. This can in turn have important repercussions for the development of commercial language technology applications for the Dutch language.

Discussion of Biases

Despite the cleaning procedure aimed at removing vulgarity and profanity, it must be considered that model trained on this scraped corpus will inevitably reflect biases present in blog articles and comments on the Internet. This makes the corpus especially interesting in the context of studying data biases and how to limit their impacts.

Additional Information

Licensing Information

AllenAI are releasing this dataset under the terms of ODC-BY. By using this, you are also bound by the Common Crawl terms of use in respect of the content contained in the dataset.

Citation Information

@article{2019t5,
    author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu},
    title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
    journal = {arXiv e-prints},
    year = {2019},
    archivePrefix = {arXiv},
    eprint = {1910.10683},
}

Contributions

Thanks to gabriele.sarti996@gmail.com , @dirkgr and @lhoestq for providing the cleaned_it_mc4 example that shows how upload a dataset to the Huggingface hub.

作者:

yhavinga

数据集大小:

2.83 GB