中文

Dataset Card for mC4

Dataset Summary

A multilingual colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: " https://commoncrawl.org" .

This is the version prepared by AllenAI, hosted at this address: https://huggingface.co/datasets/allenai/c4

108 languages are available and are reported in the table below.

Note that the languages that end with "-Latn" are simply romanized variants, i.e. written using the Latin script.

language code language name
af Afrikaans
am Amharic
ar Arabic
az Azerbaijani
be Belarusian
bg Bulgarian
bg-Latn Bulgarian (Latin)
bn Bangla
ca Catalan
ceb Cebuano
co Corsican
cs Czech
cy Welsh
da Danish
de German
el Greek
el-Latn Greek (Latin)
en English
eo Esperanto
es Spanish
et Estonian
eu Basque
fa Persian
fi Finnish
fil Filipino
fr French
fy Western Frisian
ga Irish
gd Scottish Gaelic
gl Galician
gu Gujarati
ha Hausa
haw Hawaiian
hi Hindi
hi-Latn Hindi (Latin script)
hmn Hmong, Mong
ht Haitian
hu Hungarian
hy Armenian
id Indonesian
ig Igbo
is Icelandic
it Italian
iw former Hebrew
ja Japanese
ja-Latn Japanese (Latin)
jv Javanese
ka Georgian
kk Kazakh
km Khmer
kn Kannada
ko Korean
ku Kurdish
ky Kyrgyz
la Latin
lb Luxembourgish
lo Lao
lt Lithuanian
lv Latvian
mg Malagasy
mi Maori
mk Macedonian
ml Malayalam
mn Mongolian
mr Marathi
ms Malay
mt Maltese
my Burmese
ne Nepali
nl Dutch
no Norwegian
ny Nyanja
pa Punjabi
pl Polish
ps Pashto
pt Portuguese
ro Romanian
ru Russian
ru-Latn Russian (Latin)
sd Sindhi
si Sinhala
sk Slovak
sl Slovenian
sm Samoan
sn Shona
so Somali
sq Albanian
sr Serbian
st Southern Sotho
su Sundanese
sv Swedish
sw Swahili
ta Tamil
te Telugu
tg Tajik
th Thai
tr Turkish
uk Ukrainian
und Unknown language
ur Urdu
uz Uzbek
vi Vietnamese
xh Xhosa
yi Yiddish
yo Yoruba
zh Chinese
zh-Latn Chinese (Latin)
zu Zulu

You can load the mC4 subset of any language like this:

from datasets import load_dataset

en_mc4 = load_dataset("mc4", "en")

And if you can even specify a list of languages:

from datasets import load_dataset

mc4_subset_with_five_languages = load_dataset("mc4", languages=["en", "fr", "es", "de", "zh"])

Supported Tasks and Leaderboards

mC4 is mainly intended to pretrain language models and word representations.

Languages

The dataset supports 108 languages.

Dataset Structure

Data Instances

An example form the en config is:

{'timestamp': '2018-06-24T01:32:39Z',
 'text': 'Farm Resources in Plumas County\nShow Beginning Farmer Organizations & Professionals (304)\nThere are 304 resources serving Plumas County in the following categories:\nMap of Beginning Farmer Organizations & Professionals serving Plumas County\nVictoria Fisher - Office Manager - Loyalton, CA\nAmy Lynn Rasband - UCCE Plumas-Sierra Administrative Assistant II - Quincy , CA\nShow Farm Income Opportunities Organizations & Professionals (353)\nThere are 353 resources serving Plumas County in the following categories:\nFarm Ranch And Forest Retailers (18)\nMap of Farm Income Opportunities Organizations & Professionals serving Plumas County\nWarner Valley Wildlife Area - Plumas County\nShow Farm Resources Organizations & Professionals (297)\nThere are 297 resources serving Plumas County in the following categories:\nMap of Farm Resources Organizations & Professionals serving Plumas County\nThere are 57 resources serving Plumas County in the following categories:\nMap of Organic Certification Organizations & Professionals serving Plumas County',
 'url': 'http://www.californialandcan.org/Plumas/Farm-Resources/'}

Data Fields

The data have several fields:

  • url : url of the source as a string
  • text : text content as a string
  • timestamp : timestamp as a string

Data Splits

To build mC4, the authors used CLD3 to identify over 100 languages. The resulting mC4 subsets for each language are reported in this table:

config train validation
af ? ?
am ? ?
ar ? ?
az ? ?
be ? ?
bg ? ?
bg-Latn ? ?
bn ? ?
ca ? ?
ceb ? ?
co ? ?
cs ? ?
cy ? ?
da ? ?
de ? ?
el ? ?
el-Latn ? ?
en ? ?
eo ? ?
es ? ?
et ? ?
eu ? ?
fa ? ?
fi ? ?
fil ? ?
fr ? ?
fy ? ?
ga ? ?
gd ? ?
gl ? ?
gu ? ?
ha ? ?
haw ? ?
hi ? ?
hi-Latn ? ?
hmn ? ?
ht ? ?
hu ? ?
hy ? ?
id ? ?
ig ? ?
is ? ?
it ? ?
iw ? ?
ja ? ?
ja-Latn ? ?
jv ? ?
ka ? ?
kk ? ?
km ? ?
kn ? ?
ko ? ?
ku ? ?
ky ? ?
la ? ?
lb ? ?
lo ? ?
lt ? ?
lv ? ?
mg ? ?
mi ? ?
mk ? ?
ml ? ?
mn ? ?
mr ? ?
ms ? ?
mt ? ?
my ? ?
ne ? ?
nl ? ?
no ? ?
ny ? ?
pa ? ?
pl ? ?
ps ? ?
pt ? ?
ro ? ?
ru ? ?
ru-Latn ? ?
sd ? ?
si ? ?
sk ? ?
sl ? ?
sm ? ?
sn ? ?
so ? ?
sq ? ?
sr ? ?
st ? ?
su ? ?
sv ? ?
sw ? ?
ta ? ?
te ? ?
tg ? ?
th ? ?
tr ? ?
uk ? ?
und ? ?
ur ? ?
uz ? ?
vi ? ?
xh ? ?
yi ? ?
yo ? ?
zh ? ?
zh-Latn ? ?
zu ? ?

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

AllenAI are releasing this dataset under the terms of ODC-BY. By using this, you are also bound by the Common Crawl terms of use in respect of the content contained in the dataset.

Citation Information

@article{2019t5,
    author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu},
    title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
    journal = {arXiv e-prints},
    year = {2019},
    archivePrefix = {arXiv},
    eprint = {1910.10683},
}

Contributions

Thanks to @dirkgr and @lhoestq for adding this dataset.