数据集:
mc4
计算机处理:
multilingual语言创建人:
found批注创建人:
no-annotation源数据集:
original预印本库:
arxiv:1910.10683许可:
odc-byA multilingual colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: " https://commoncrawl.org" .
This is the version prepared by AllenAI, hosted at this address: https://huggingface.co/datasets/allenai/c4
108 languages are available and are reported in the table below.
Note that the languages that end with "-Latn" are simply romanized variants, i.e. written using the Latin script.
language code | language name |
---|---|
af | Afrikaans |
am | Amharic |
ar | Arabic |
az | Azerbaijani |
be | Belarusian |
bg | Bulgarian |
bg-Latn | Bulgarian (Latin) |
bn | Bangla |
ca | Catalan |
ceb | Cebuano |
co | Corsican |
cs | Czech |
cy | Welsh |
da | Danish |
de | German |
el | Greek |
el-Latn | Greek (Latin) |
en | English |
eo | Esperanto |
es | Spanish |
et | Estonian |
eu | Basque |
fa | Persian |
fi | Finnish |
fil | Filipino |
fr | French |
fy | Western Frisian |
ga | Irish |
gd | Scottish Gaelic |
gl | Galician |
gu | Gujarati |
ha | Hausa |
haw | Hawaiian |
hi | Hindi |
hi-Latn | Hindi (Latin script) |
hmn | Hmong, Mong |
ht | Haitian |
hu | Hungarian |
hy | Armenian |
id | Indonesian |
ig | Igbo |
is | Icelandic |
it | Italian |
iw | former Hebrew |
ja | Japanese |
ja-Latn | Japanese (Latin) |
jv | Javanese |
ka | Georgian |
kk | Kazakh |
km | Khmer |
kn | Kannada |
ko | Korean |
ku | Kurdish |
ky | Kyrgyz |
la | Latin |
lb | Luxembourgish |
lo | Lao |
lt | Lithuanian |
lv | Latvian |
mg | Malagasy |
mi | Maori |
mk | Macedonian |
ml | Malayalam |
mn | Mongolian |
mr | Marathi |
ms | Malay |
mt | Maltese |
my | Burmese |
ne | Nepali |
nl | Dutch |
no | Norwegian |
ny | Nyanja |
pa | Punjabi |
pl | Polish |
ps | Pashto |
pt | Portuguese |
ro | Romanian |
ru | Russian |
ru-Latn | Russian (Latin) |
sd | Sindhi |
si | Sinhala |
sk | Slovak |
sl | Slovenian |
sm | Samoan |
sn | Shona |
so | Somali |
sq | Albanian |
sr | Serbian |
st | Southern Sotho |
su | Sundanese |
sv | Swedish |
sw | Swahili |
ta | Tamil |
te | Telugu |
tg | Tajik |
th | Thai |
tr | Turkish |
uk | Ukrainian |
und | Unknown language |
ur | Urdu |
uz | Uzbek |
vi | Vietnamese |
xh | Xhosa |
yi | Yiddish |
yo | Yoruba |
zh | Chinese |
zh-Latn | Chinese (Latin) |
zu | Zulu |
You can load the mC4 subset of any language like this:
from datasets import load_dataset en_mc4 = load_dataset("mc4", "en")
And if you can even specify a list of languages:
from datasets import load_dataset mc4_subset_with_five_languages = load_dataset("mc4", languages=["en", "fr", "es", "de", "zh"])
mC4 is mainly intended to pretrain language models and word representations.
The dataset supports 108 languages.
An example form the en config is:
{'timestamp': '2018-06-24T01:32:39Z', 'text': 'Farm Resources in Plumas County\nShow Beginning Farmer Organizations & Professionals (304)\nThere are 304 resources serving Plumas County in the following categories:\nMap of Beginning Farmer Organizations & Professionals serving Plumas County\nVictoria Fisher - Office Manager - Loyalton, CA\nAmy Lynn Rasband - UCCE Plumas-Sierra Administrative Assistant II - Quincy , CA\nShow Farm Income Opportunities Organizations & Professionals (353)\nThere are 353 resources serving Plumas County in the following categories:\nFarm Ranch And Forest Retailers (18)\nMap of Farm Income Opportunities Organizations & Professionals serving Plumas County\nWarner Valley Wildlife Area - Plumas County\nShow Farm Resources Organizations & Professionals (297)\nThere are 297 resources serving Plumas County in the following categories:\nMap of Farm Resources Organizations & Professionals serving Plumas County\nThere are 57 resources serving Plumas County in the following categories:\nMap of Organic Certification Organizations & Professionals serving Plumas County', 'url': 'http://www.californialandcan.org/Plumas/Farm-Resources/'}
The data have several fields:
To build mC4, the authors used CLD3 to identify over 100 languages. The resulting mC4 subsets for each language are reported in this table:
config | train | validation |
---|---|---|
af | ? | ? |
am | ? | ? |
ar | ? | ? |
az | ? | ? |
be | ? | ? |
bg | ? | ? |
bg-Latn | ? | ? |
bn | ? | ? |
ca | ? | ? |
ceb | ? | ? |
co | ? | ? |
cs | ? | ? |
cy | ? | ? |
da | ? | ? |
de | ? | ? |
el | ? | ? |
el-Latn | ? | ? |
en | ? | ? |
eo | ? | ? |
es | ? | ? |
et | ? | ? |
eu | ? | ? |
fa | ? | ? |
fi | ? | ? |
fil | ? | ? |
fr | ? | ? |
fy | ? | ? |
ga | ? | ? |
gd | ? | ? |
gl | ? | ? |
gu | ? | ? |
ha | ? | ? |
haw | ? | ? |
hi | ? | ? |
hi-Latn | ? | ? |
hmn | ? | ? |
ht | ? | ? |
hu | ? | ? |
hy | ? | ? |
id | ? | ? |
ig | ? | ? |
is | ? | ? |
it | ? | ? |
iw | ? | ? |
ja | ? | ? |
ja-Latn | ? | ? |
jv | ? | ? |
ka | ? | ? |
kk | ? | ? |
km | ? | ? |
kn | ? | ? |
ko | ? | ? |
ku | ? | ? |
ky | ? | ? |
la | ? | ? |
lb | ? | ? |
lo | ? | ? |
lt | ? | ? |
lv | ? | ? |
mg | ? | ? |
mi | ? | ? |
mk | ? | ? |
ml | ? | ? |
mn | ? | ? |
mr | ? | ? |
ms | ? | ? |
mt | ? | ? |
my | ? | ? |
ne | ? | ? |
nl | ? | ? |
no | ? | ? |
ny | ? | ? |
pa | ? | ? |
pl | ? | ? |
ps | ? | ? |
pt | ? | ? |
ro | ? | ? |
ru | ? | ? |
ru-Latn | ? | ? |
sd | ? | ? |
si | ? | ? |
sk | ? | ? |
sl | ? | ? |
sm | ? | ? |
sn | ? | ? |
so | ? | ? |
sq | ? | ? |
sr | ? | ? |
st | ? | ? |
su | ? | ? |
sv | ? | ? |
sw | ? | ? |
ta | ? | ? |
te | ? | ? |
tg | ? | ? |
th | ? | ? |
tr | ? | ? |
uk | ? | ? |
und | ? | ? |
ur | ? | ? |
uz | ? | ? |
vi | ? | ? |
xh | ? | ? |
yi | ? | ? |
yo | ? | ? |
zh | ? | ? |
zh-Latn | ? | ? |
zu | ? | ? |
[More Information Needed]
[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
AllenAI are releasing this dataset under the terms of ODC-BY. By using this, you are also bound by the Common Crawl terms of use in respect of the content contained in the dataset.
@article{2019t5, author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu}, title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer}, journal = {arXiv e-prints}, year = {2019}, archivePrefix = {arXiv}, eprint = {1910.10683}, }