数据集:
mc4
计算机处理:
multilingual语言创建人:
found批注创建人:
no-annotation源数据集:
original预印本库:
arxiv:1910.10683许可:
A multilingual colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: " https://commoncrawl.org" .
This is the version prepared by AllenAI, hosted at this address: https://huggingface.co/datasets/allenai/c4
108 languages are available and are reported in the table below.
Note that the languages that end with "-Latn" are simply romanized variants, i.e. written using the Latin script.
| language code | language name |
|---|---|
| af | Afrikaans |
| am | Amharic |
| ar | Arabic |
| az | Azerbaijani |
| be | Belarusian |
| bg | Bulgarian |
| bg-Latn | Bulgarian (Latin) |
| bn | Bangla |
| ca | Catalan |
| ceb | Cebuano |
| co | Corsican |
| cs | Czech |
| cy | Welsh |
| da | Danish |
| de | German |
| el | Greek |
| el-Latn | Greek (Latin) |
| en | English |
| eo | Esperanto |
| es | Spanish |
| et | Estonian |
| eu | Basque |
| fa | Persian |
| fi | Finnish |
| fil | Filipino |
| fr | French |
| fy | Western Frisian |
| ga | Irish |
| gd | Scottish Gaelic |
| gl | Galician |
| gu | Gujarati |
| ha | Hausa |
| haw | Hawaiian |
| hi | Hindi |
| hi-Latn | Hindi (Latin script) |
| hmn | Hmong, Mong |
| ht | Haitian |
| hu | Hungarian |
| hy | Armenian |
| id | Indonesian |
| ig | Igbo |
| is | Icelandic |
| it | Italian |
| iw | former Hebrew |
| ja | Japanese |
| ja-Latn | Japanese (Latin) |
| jv | Javanese |
| ka | Georgian |
| kk | Kazakh |
| km | Khmer |
| kn | Kannada |
| ko | Korean |
| ku | Kurdish |
| ky | Kyrgyz |
| la | Latin |
| lb | Luxembourgish |
| lo | Lao |
| lt | Lithuanian |
| lv | Latvian |
| mg | Malagasy |
| mi | Maori |
| mk | Macedonian |
| ml | Malayalam |
| mn | Mongolian |
| mr | Marathi |
| ms | Malay |
| mt | Maltese |
| my | Burmese |
| ne | Nepali |
| nl | Dutch |
| no | Norwegian |
| ny | Nyanja |
| pa | Punjabi |
| pl | Polish |
| ps | Pashto |
| pt | Portuguese |
| ro | Romanian |
| ru | Russian |
| ru-Latn | Russian (Latin) |
| sd | Sindhi |
| si | Sinhala |
| sk | Slovak |
| sl | Slovenian |
| sm | Samoan |
| sn | Shona |
| so | Somali |
| sq | Albanian |
| sr | Serbian |
| st | Southern Sotho |
| su | Sundanese |
| sv | Swedish |
| sw | Swahili |
| ta | Tamil |
| te | Telugu |
| tg | Tajik |
| th | Thai |
| tr | Turkish |
| uk | Ukrainian |
| und | Unknown language |
| ur | Urdu |
| uz | Uzbek |
| vi | Vietnamese |
| xh | Xhosa |
| yi | Yiddish |
| yo | Yoruba |
| zh | Chinese |
| zh-Latn | Chinese (Latin) |
| zu | Zulu |
You can load the mC4 subset of any language like this:
from datasets import load_dataset
en_mc4 = load_dataset("mc4", "en")
And if you can even specify a list of languages:
from datasets import load_dataset
mc4_subset_with_five_languages = load_dataset("mc4", languages=["en", "fr", "es", "de", "zh"])
mC4 is mainly intended to pretrain language models and word representations.
The dataset supports 108 languages.
An example form the en config is:
{'timestamp': '2018-06-24T01:32:39Z',
'text': 'Farm Resources in Plumas County\nShow Beginning Farmer Organizations & Professionals (304)\nThere are 304 resources serving Plumas County in the following categories:\nMap of Beginning Farmer Organizations & Professionals serving Plumas County\nVictoria Fisher - Office Manager - Loyalton, CA\nAmy Lynn Rasband - UCCE Plumas-Sierra Administrative Assistant II - Quincy , CA\nShow Farm Income Opportunities Organizations & Professionals (353)\nThere are 353 resources serving Plumas County in the following categories:\nFarm Ranch And Forest Retailers (18)\nMap of Farm Income Opportunities Organizations & Professionals serving Plumas County\nWarner Valley Wildlife Area - Plumas County\nShow Farm Resources Organizations & Professionals (297)\nThere are 297 resources serving Plumas County in the following categories:\nMap of Farm Resources Organizations & Professionals serving Plumas County\nThere are 57 resources serving Plumas County in the following categories:\nMap of Organic Certification Organizations & Professionals serving Plumas County',
'url': 'http://www.californialandcan.org/Plumas/Farm-Resources/'}
The data have several fields:
To build mC4, the authors used CLD3 to identify over 100 languages. The resulting mC4 subsets for each language are reported in this table:
| config | train | validation |
|---|---|---|
| af | ? | ? |
| am | ? | ? |
| ar | ? | ? |
| az | ? | ? |
| be | ? | ? |
| bg | ? | ? |
| bg-Latn | ? | ? |
| bn | ? | ? |
| ca | ? | ? |
| ceb | ? | ? |
| co | ? | ? |
| cs | ? | ? |
| cy | ? | ? |
| da | ? | ? |
| de | ? | ? |
| el | ? | ? |
| el-Latn | ? | ? |
| en | ? | ? |
| eo | ? | ? |
| es | ? | ? |
| et | ? | ? |
| eu | ? | ? |
| fa | ? | ? |
| fi | ? | ? |
| fil | ? | ? |
| fr | ? | ? |
| fy | ? | ? |
| ga | ? | ? |
| gd | ? | ? |
| gl | ? | ? |
| gu | ? | ? |
| ha | ? | ? |
| haw | ? | ? |
| hi | ? | ? |
| hi-Latn | ? | ? |
| hmn | ? | ? |
| ht | ? | ? |
| hu | ? | ? |
| hy | ? | ? |
| id | ? | ? |
| ig | ? | ? |
| is | ? | ? |
| it | ? | ? |
| iw | ? | ? |
| ja | ? | ? |
| ja-Latn | ? | ? |
| jv | ? | ? |
| ka | ? | ? |
| kk | ? | ? |
| km | ? | ? |
| kn | ? | ? |
| ko | ? | ? |
| ku | ? | ? |
| ky | ? | ? |
| la | ? | ? |
| lb | ? | ? |
| lo | ? | ? |
| lt | ? | ? |
| lv | ? | ? |
| mg | ? | ? |
| mi | ? | ? |
| mk | ? | ? |
| ml | ? | ? |
| mn | ? | ? |
| mr | ? | ? |
| ms | ? | ? |
| mt | ? | ? |
| my | ? | ? |
| ne | ? | ? |
| nl | ? | ? |
| no | ? | ? |
| ny | ? | ? |
| pa | ? | ? |
| pl | ? | ? |
| ps | ? | ? |
| pt | ? | ? |
| ro | ? | ? |
| ru | ? | ? |
| ru-Latn | ? | ? |
| sd | ? | ? |
| si | ? | ? |
| sk | ? | ? |
| sl | ? | ? |
| sm | ? | ? |
| sn | ? | ? |
| so | ? | ? |
| sq | ? | ? |
| sr | ? | ? |
| st | ? | ? |
| su | ? | ? |
| sv | ? | ? |
| sw | ? | ? |
| ta | ? | ? |
| te | ? | ? |
| tg | ? | ? |
| th | ? | ? |
| tr | ? | ? |
| uk | ? | ? |
| und | ? | ? |
| ur | ? | ? |
| uz | ? | ? |
| vi | ? | ? |
| xh | ? | ? |
| yi | ? | ? |
| yo | ? | ? |
| zh | ? | ? |
| zh-Latn | ? | ? |
| zu | ? | ? |
[More Information Needed]
[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
AllenAI are releasing this dataset under the terms of ODC-BY. By using this, you are also bound by the Common Crawl terms of use in respect of the content contained in the dataset.
@article{2019t5,
author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu},
title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
journal = {arXiv e-prints},
year = {2019},
archivePrefix = {arXiv},
eprint = {1910.10683},
}