数据集:
Muennighoff/xP3x
xP3x (Crosslingual Public Pool of Prompts eXtended) is a collection of prompts & datasets across 277 languages & 16 NLP tasks. It contains all of xP3 + much more! It is used for training future contenders of mT0 & BLOOMZ at project Aya @ C4AI ?
Name | Explanation | Example models |
---|---|---|
xP3x | Mixture of 17 tasks in 277 languages with English prompts | WIP - Join us at Project Aya @ C4AI to help! |
xP3 | Mixture of 13 training tasks in 46 languages with English prompts | bloomz & mt0-xxl |
xP3mt | Mixture of 13 training tasks in 46 languages with prompts in 20 languages (machine-translated from English) | bloomz-mt & mt0-xxl-mt |
xP3all | xP3 + evaluation datasets adding an additional 3 tasks for a total of 16 tasks in 46 languages with English prompts | |
xP3megds | Megatron-DeepSpeed processed version of xP3 | bloomz |
P3 | Repreprocessed version of the English-only P3 with 8 training tasks | bloomz-p3 & mt0-xxl-p3 |
An example looks as follows:
{ 'inputs': '11月、遂にクロームはファイヤーフォックスを引き離し始めた。_はインターネットユーザーの評価が高まったのだ。\nReplace the _ in the above sentence with the correct option: \n- ファイヤーフォックス\n- クローム', 'targets': 'クローム', 'language': 'jpn_Jpan', 'split': 'test', 'template': 'Replace', 'dataset': 'Muennighoff/xwinograd', 'config': 'jp' }
The data fields are the same among all splits:
The dataset has 680 gigabytes and 530 million samples. You may want to filter it and then deduplicate depending on your needs.
Loading by language:
# pip install -q datasets from datasets import load_dataset ds = load_dataset("Muennighoff/xP3x", "zho_Hans", streaming=True) # Use streaming to not download all at once for x in ds["train"]: print(x) break
You can then filter down by the data fields to e.g. only get certain configs or datasets. As every dataset-config-template is its own jsonl file, you can also decide on the datasets, configs and templates you want and only download them. For example, to download all Japanese xwinograd samples, you could do:
# pip install -q datasets from datasets import load_dataset import multiprocessing # pip install --upgrade huggingface-hub from huggingface_hub import HfFileSystem, hf_hub_url fs = HfFileSystem() fps = fs.glob(f"datasets/Muennighoff/xP3x/data/jpn_Jpan/*xwinograd*") resolved_paths = [fs.resolve_path(file) for file in fps] data_files = [hf_hub_url(resolved_path.repo_id, resolved_path.path_in_repo, repo_type=resolved_path.repo_type) for resolved_path in resolved_paths] ds = load_dataset("json", data_files=data_files, num_proc=8)["train"]
Language | Code | Kilobytes | % | Samples | % |
---|---|---|---|---|---|
Emilian | egl_Latn | 104 | 0.0 | 402 | 0.0 |
Swiss German | gsw_Latn | 104 | 0.0 | 408 | 0.0 |
Novial | nov_Latn | 116 | 0.0 | 432 | 0.0 |
Ainu (Latin script) | ain_Latn | 120 | 0.0 | 410 | 0.0 |
Chamorro | cha_Latn | 120 | 0.0 | 452 | 0.0 |
Gothic | got_Goth | 120 | 0.0 | 402 | 0.0 |
Prussian | prg_Latn | 120 | 0.0 | 424 | 0.0 |
Picard | pcd_Latn | 140 | 0.0 | 530 | 0.0 |
Northern Frisian | frr_Latn | 156 | 0.0 | 554 | 0.0 |
Uzbek (Latin script) | uzb_Latn | 156 | 0.0 | 600 | 0.0 |
Ottoman Turkish (Latin script) | ota_Latn | 188 | 0.0 | 632 | 0.0 |
Swahili (macrolanguage) | swa_Latn | 212 | 0.0 | 772 | 0.0 |
Talossan | tzl_Latn | 220 | 0.0 | 836 | 0.0 |
Kven Finnish | fkv_Latn | 260 | 0.0 | 910 | 0.0 |
Zaza | zza_Latn | 260 | 0.0 | 1,056 | 0.0 |
Frisian | fry_Latn | 268 | 0.0 | 956 | 0.0 |
Piemontese | pms_Latn | 276 | 0.0 | 998 | 0.0 |
Kalmyk | xal_Cyrl | 288 | 0.0 | 976 | 0.0 |
Hunsrik | hrx_Latn | 352 | 0.0 | 1,380 | 0.0 |
Romany | rom_Latn | 364 | 0.0 | 1,410 | 0.0 |
Ancient Greek (to 1453) | grc_Grek | 392 | 0.0 | 1,226 | 0.0 |
Tase Naga | nst_Latn | 424 | 0.0 | 1,608 | 0.0 |
Albanian | sqi_Latn | 596 | 0.0 | 2,216 | 0.0 |
Guadeloupean Creole French | gcf_Latn | 608 | 0.0 | 2,326 | 0.0 |
Yakut | sah_Cyrl | 608 | 0.0 | 1,986 | 0.0 |
Ho (Latin script) | hoc_Latn | 632 | 0.0 | 2,634 | 0.0 |
Khasi | kha_Latn | 676 | 0.0 | 2,664 | 0.0 |
Algerian Arabic | arq_Arab | 688 | 0.0 | 2,278 | 0.0 |
Lower Sorbian | dsb_Latn | 692 | 0.0 | 2,596 | 0.0 |
Chuvash | chv_Cyrl | 716 | 0.0 | 2,446 | 0.0 |
Old Russian | orv_Cyrl | 752 | 0.0 | 2,586 | 0.0 |
Pampanga | pam_Latn | 784 | 0.0 | 2,984 | 0.0 |
Kurdish (Latin script) | kur_Latn | 796 | 0.0 | 3,050 | 0.0 |
Ottoman Turkish | ota_Arab | 832 | 0.0 | 2,772 | 0.0 |
Kotava | avk_Latn | 864 | 0.0 | 3,118 | 0.0 |
Upper Sorbian | hsb_Latn | 900 | 0.0 | 3,474 | 0.0 |
Buryat | bua_Cyrl | 924 | 0.0 | 3,218 | 0.0 |
Swabian | swg_Latn | 996 | 0.0 | 3,366 | 0.0 |
Coastal Kadazan | kzj_Latn | 1,136 | 0.0 | 3,766 | 0.0 |
Chavacano | cbk_Latn | 1,352 | 0.0 | 4,994 | 0.0 |
Quechua | que_Latn | 1,704 | 0.0 | 5,312 | 0.0 |
Lingua Franca Nova (Cyrillic script) | lfn_Cyrl | 1,740 | 0.0 | 5,458 | 0.0 |
Gronings | gos_Latn | 1,864 | 0.0 | 7,462 | 0.0 |
Volapük | vol_Latn | 1,948 | 0.0 | 7,712 | 0.0 |
Yue Chinese (Simplified) | yue_Hans | 2,300 | 0.0 | 7,872 | 0.0 |
Mari (Russia) | chm_Cyrl | 2,540 | 0.0 | 7,496 | 0.0 |
Kadazan Dusun | dtp_Latn | 2,548 | 0.0 | 8,892 | 0.0 |
Breton | bre_Latn | 3,048 | 0.0 | 11,868 | 0.0 |
Ladino | lad_Latn | 3,224 | 0.0 | 11,916 | 0.0 |
Cornish | cor_Latn | 3,492 | 0.0 | 13,880 | 0.0 |
Interlingue | ile_Latn | 3,700 | 0.0 | 14,468 | 0.0 |
Wu Chinese | wuu_Hans | 3,784 | 0.0 | 13,062 | 0.0 |
Japanese (Katakana) | jpn_Kana | 4,208 | 0.0 | 13,942 | 0.0 |
Ido | ido_Latn | 6,180 | 0.0 | 23,742 | 0.0 |
Yiddishi | yid_Hebr | 9,896 | 0.0 | 34,412 | 0.01 |
Klingon | tlh_Latn | 11,716 | 0.0 | 46,010 | 0.01 |
Lingua Franca Nova | lfn_Latn | 13,328 | 0.0 | 46,826 | 0.01 |
Lojban | jbo_Latn | 17,468 | 0.0 | 66,694 | 0.01 |
Low German | nds_Latn | 18,364 | 0.0 | 68,098 | 0.01 |
Interlingua (International Auxiliary Language Association) | ina_Latn | 25,700 | 0.0 | 76,584 | 0.01 |
Java | java | 25,904 | 0.0 | 13,551 | 0.0 |
Japanese (Kanji) | jpn_Hani | 26,292 | 0.0 | 89,978 | 0.02 |
Norwegian | nor_Latn | 26,724 | 0.0 | 93,116 | 0.02 |
Toki Pona | toki_Latn | 26,808 | 0.0 | 97,170 | 0.02 |
Latin | lat_Latn | 28,900 | 0.0 | 101,390 | 0.02 |
Serbo-Croatian | hbs_Latn | 29,452 | 0.0 | 105,748 | 0.02 |
Nigerian Pidgin | pcm_Latn | 145,872 | 0.02 | 88,992 | 0.02 |
Azerbaijani (South or North; Latin script) | aze_Latn | 147,564 | 0.02 | 77,875 | 0.01 |
Serbian (Latin script) | srp_Latn | 179,072 | 0.03 | 131,101 | 0.02 |
Japanese (Hiragana) | jpn_Hira | 188,944 | 0.03 | 628,758 | 0.12 |
Berber (Latin script) | ber_Latn | 201,464 | 0.03 | 693,602 | 0.13 |
Jupyter Notebook | jupyter_notebook | 416,056 | 0.06 | 400,000 | 0.08 |
Yue Chinese | yue_Hant | 613,352 | 0.09 | 1,227,429 | 0.23 |
Haitian Creole | hat_Latn | 629,420 | 0.09 | 1,228,281 | 0.23 |
Mossi | mos_Latn | 630,416 | 0.09 | 1,223,481 | 0.23 |
Pangasinan | pag_Latn | 630,684 | 0.09 | 1,223,481 | 0.23 |
Twi | twi_Latn | 631,172 | 0.09 | 1,223,481 | 0.23 |
Bosnian | bos_Latn | 633,016 | 0.09 | 1,224,479 | 0.23 |
Ewe | ewe_Latn | 633,292 | 0.09 | 1,223,481 | 0.23 |
Bambara | bam_Latn | 634,520 | 0.09 | 1,223,481 | 0.23 |
Javanese | jav_Latn | 635,248 | 0.09 | 1,224,003 | 0.23 |
Southwestern Dinka | dik_Latn | 635,416 | 0.09 | 1,223,481 | 0.23 |
Kabuverdianu | kea_Latn | 636,144 | 0.09 | 1,223,481 | 0.23 |
Dyula | dyu_Latn | 636,464 | 0.09 | 1,223,481 | 0.23 |
Venetian | vec_Latn | 637,412 | 0.09 | 1,223,481 | 0.23 |
Chokwe | cjk_Latn | 637,532 | 0.09 | 1,223,481 | 0.23 |
Latgalian | ltg_Latn | 637,612 | 0.09 | 1,223,481 | 0.23 |
Sundanese | sun_Latn | 638,120 | 0.09 | 1,223,481 | 0.23 |
Asturian | ast_Latn | 638,708 | 0.09 | 1,223,481 | 0.23 |
Akan | aka_Latn | 639,648 | 0.09 | 1,223,481 | 0.23 |
Mizo | lus_Latn | 639,680 | 0.09 | 1,223,481 | 0.23 |
Guarani | grn_Latn | 641,540 | 0.09 | 1,225,647 | 0.23 |
Limburgish | lim_Latn | 642,368 | 0.09 | 1,223,481 | 0.23 |
Faroese | fao_Latn | 642,432 | 0.09 | 1,224,067 | 0.23 |
Buginese | bug_Latn | 643,472 | 0.09 | 1,223,481 | 0.23 |
Sango | sag_Latn | 643,596 | 0.09 | 1,223,481 | 0.23 |
Luba-Kasai | lua_Latn | 643,640 | 0.09 | 1,223,481 | 0.23 |
Papiamento | pap_Latn | 643,648 | 0.09 | 1,223,481 | 0.23 |
Silesian | szl_Latn | 644,608 | 0.09 | 1,223,481 | 0.23 |
Sicilian | scn_Latn | 645,636 | 0.1 | 1,223,481 | 0.23 |
Kimbundu | kmb_Latn | 645,964 | 0.1 | 1,223,481 | 0.23 |
Basque | eus_Latn | 646,084 | 0.1 | 1,246,877 | 0.23 |
Balinese | ban_Latn | 646,408 | 0.1 | 1,223,481 | 0.23 |
Norwegian Nynorsk | nno_Latn | 646,996 | 0.1 | 1,229,699 | 0.23 |
Central Aymara | ayr_Latn | 647,236 | 0.1 | 1,223,481 | 0.23 |
Tamasheq (Latin script) | taq_Latn | 648,656 | 0.1 | 1,223,481 | 0.23 |
Kikongo | kon_Latn | 648,992 | 0.1 | 1,223,481 | 0.23 |
Friulian | fur_Latn | 649,272 | 0.1 | 1,223,481 | 0.23 |
Ayacucho Quechua | quy_Latn | 649,992 | 0.1 | 1,223,481 | 0.23 |
Maori | mri_Latn | 650,336 | 0.1 | 1,224,211 | 0.23 |
Icelandic | isl_Latn | 650,372 | 0.1 | 1,246,623 | 0.23 |
Galician | glg_Latn | 652,088 | 0.1 | 1,233,291 | 0.23 |
Catalan | cat_Latn | 652,116 | 0.1 | 1,241,381 | 0.23 |
Lombard | lmo_Latn | 652,120 | 0.1 | 1,223,481 | 0.23 |
Banjar (Latin script) | bjn_Latn | 652,372 | 0.1 | 1,223,481 | 0.23 |
Fijian | fij_Latn | 652,796 | 0.1 | 1,223,481 | 0.23 |
Crimean Tatar | crh_Latn | 653,920 | 0.1 | 1,223,895 | 0.23 |
Northern Kurdish | kmr_Latn | 654,108 | 0.1 | 1,223,481 | 0.23 |
Ligurian | lij_Latn | 654,432 | 0.1 | 1,223,481 | 0.23 |
Occitan | oci_Latn | 655,676 | 0.1 | 1,227,945 | 0.23 |
Turkmen | tuk_Latn | 658,672 | 0.1 | 1,241,205 | 0.23 |
Luxembourgish | ltz_Latn | 658,768 | 0.1 | 1,225,339 | 0.23 |
Cebuano | ceb_Latn | 659,124 | 0.1 | 1,226,039 | 0.23 |
Samoan | smo_Latn | 659,704 | 0.1 | 1,223,481 | 0.23 |
Sardinian | srd_Latn | 660,000 | 0.1 | 1,223,481 | 0.23 |
Bemba | bem_Latn | 660,504 | 0.1 | 1,223,481 | 0.23 |
Minangkabau (Latin script) | min_Latn | 660,672 | 0.1 | 1,223,481 | 0.23 |
Acehnese (Latin script) | ace_Latn | 661,084 | 0.1 | 1,223,481 | 0.23 |
Ilocano | ilo_Latn | 661,184 | 0.1 | 1,227,663 | 0.23 |
Irish | gle_Latn | 661,660 | 0.1 | 1,227,357 | 0.23 |
Fon | fon_Latn | 663,124 | 0.1 | 1,223,481 | 0.23 |
Waray | war_Latn | 664,120 | 0.1 | 1,226,503 | 0.23 |
Norwegian Bokmål | nob_Latn | 666,240 | 0.1 | 1,300,607 | 0.24 |
Tosk Albanian | als_Latn | 666,692 | 0.1 | 1,223,481 | 0.23 |
Standard Malay | zsm_Latn | 667,088 | 0.1 | 1,270,715 | 0.24 |
Southern Sotho | sot_Latn | 667,728 | 0.1 | 1,223,481 | 0.23 |
Kabyle | kab_Latn | 668,128 | 0.1 | 1,346,605 | 0.25 |
Jingpho | kac_Latn | 669,464 | 0.1 | 1,223,481 | 0.23 |
Lingala | lin_Latn | 670,428 | 0.1 | 1,323,481 | 0.25 |
Wolof | wol_Latn | 670,568 | 0.1 | 1,373,481 | 0.26 |
Central Kanuri (Latin script) | knc_Latn | 670,800 | 0.1 | 1,223,481 | 0.23 |
Kikuyu | kik_Latn | 672,096 | 0.1 | 1,223,481 | 0.23 |
Tok Pisin | tpi_Latn | 672,916 | 0.1 | 1,223,481 | 0.23 |
Nuer | nus_Latn | 673,632 | 0.1 | 1,223,481 | 0.23 |
Tagalog | tgl_Latn | 673,684 | 0.1 | 1,247,417 | 0.23 |
Tumbuka | tum_Latn | 676,948 | 0.1 | 1,223,481 | 0.23 |
Plateau Malagasy | plt_Latn | 677,852 | 0.1 | 1,223,481 | 0.23 |
Afrikaans | afr_Latn | 679,164 | 0.1 | 1,337,091 | 0.25 |
North Azerbaijani | azj_Latn | 679,820 | 0.1 | 1,223,481 | 0.23 |
Kabiyè | kbp_Latn | 684,880 | 0.1 | 1,223,481 | 0.23 |
Modern Standard Arabic (Romanized) | arb_Latn | 685,408 | 0.1 | 1,223,481 | 0.23 |
Scottish Gaelic | gla_Latn | 708,620 | 0.1 | 1,243,627 | 0.23 |
Sindhi | snd_Arab | 718,680 | 0.11 | 1,223,481 | 0.23 |
North Levantine Arabic | apc_Arab | 720,048 | 0.11 | 1,223,481 | 0.23 |
Tunisian Arabic | aeb_Arab | 720,360 | 0.11 | 1,223,481 | 0.23 |
South Levantine Arabic | ajp_Arab | 720,488 | 0.11 | 1,223,481 | 0.23 |
Dari | prs_Arab | 720,500 | 0.11 | 1,223,481 | 0.23 |
Moroccan Arabic | ary_Arab | 722,904 | 0.11 | 1,223,481 | 0.23 |
Egyptian Arabic | arz_Arab | 723,356 | 0.11 | 1,223,481 | 0.23 |
Najdi Arabic | ars_Arab | 725,784 | 0.11 | 1,223,481 | 0.23 |
Acehnese (Arabic script) | ace_Arab | 726,272 | 0.11 | 1,223,481 | 0.23 |
Mesopotamian Arabic | acm_Arab | 728,472 | 0.11 | 1,223,481 | 0.23 |
Ta’izzi-Adeni Arabic | acq_Arab | 734,780 | 0.11 | 1,223,481 | 0.23 |
South Azerbaijani | azb_Arab | 735,728 | 0.11 | 1,223,481 | 0.23 |
Central Kanuri (Arabic script) | knc_Arab | 746,936 | 0.11 | 1,223,481 | 0.23 |
Rundi | run_Latn | 749,792 | 0.11 | 1,296,111 | 0.24 |
Banjar (Arabic script) | bjn_Arab | 751,112 | 0.11 | 1,223,481 | 0.23 |
Central Kurdish | ckb_Arab | 756,804 | 0.11 | 1,223,481 | 0.23 |
Bashkir | bak_Cyrl | 758,816 | 0.11 | 1,223,481 | 0.23 |
Kashmiri (Arabic script) | kas_Arab | 759,140 | 0.11 | 1,223,481 | 0.23 |
Tatar | tat_Cyrl | 764,212 | 0.11 | 1,247,685 | 0.23 |
Minangkabau (Arabic script) | min_Arab | 765,384 | 0.11 | 1,223,481 | 0.23 |
Kazakh | kaz_Cyrl | 766,176 | 0.11 | 1,232,697 | 0.23 |
Halh Mongolian | khk_Cyrl | 776,384 | 0.11 | 1,224,353 | 0.23 |
Tajik | tgk_Cyrl | 780,452 | 0.11 | 1,223,481 | 0.23 |
Eastern Yiddish | ydd_Hebr | 781,452 | 0.12 | 1,223,481 | 0.23 |
Uyghur | uig_Arab | 785,444 | 0.12 | 1,256,999 | 0.24 |
Armenian | hye_Armn | 789,952 | 0.12 | 1,228,171 | 0.23 |
Hebrew | heb_Hebr | 793,144 | 0.12 | 1,604,365 | 0.3 |
Belarusian | bel_Cyrl | 806,588 | 0.12 | 1,261,197 | 0.24 |
Macedonian | mkd_Cyrl | 813,436 | 0.12 | 1,384,567 | 0.26 |
Welsh | cym_Latn | 821,036 | 0.12 | 1,321,455 | 0.25 |
Northern Uzbek | uzn_Latn | 835,560 | 0.12 | 1,273,404 | 0.24 |
Central Atlas Tamazight | tzm_Tfng | 843,508 | 0.12 | 1,223,481 | 0.23 |
Tamasheq (Tifinagh script) | taq_Tfng | 848,104 | 0.12 | 1,223,481 | 0.23 |
Magahi | mag_Deva | 851,360 | 0.13 | 1,223,481 | 0.23 |
Bhojpuri | bho_Deva | 854,848 | 0.13 | 1,223,481 | 0.23 |
Awadhi | awa_Deva | 857,096 | 0.13 | 1,224,037 | 0.23 |
Chhattisgarhi | hne_Deva | 859,332 | 0.13 | 1,223,481 | 0.23 |
Kyrgyz | kir_Cyrl | 860,700 | 0.13 | 1,250,163 | 0.23 |
Maithili | mai_Deva | 863,476 | 0.13 | 1,223,481 | 0.23 |
Assamese | asm_Beng | 865,904 | 0.13 | 1,223,481 | 0.23 |
Kashmiri (Devanagari script) | kas_Deva | 867,232 | 0.13 | 1,223,481 | 0.23 |
Sanskrit | san_Deva | 879,236 | 0.13 | 1,223,481 | 0.23 |
Lao | lao_Laoo | 888,240 | 0.13 | 1,223,481 | 0.23 |
Odia | ory_Orya | 890,508 | 0.13 | 1,223,481 | 0.23 |
Santali | sat_Olck | 902,300 | 0.13 | 1,223,481 | 0.23 |
Kannada | kan_Knda | 909,260 | 0.13 | 1,223,481 | 0.23 |
Meitei (Bengali script) | mni_Beng | 917,984 | 0.14 | 1,223,481 | 0.23 |
Georgian | kat_Geor | 928,712 | 0.14 | 1,226,729 | 0.23 |
Kamba | kam_Latn | 936,468 | 0.14 | 2,136,615 | 0.4 |
Tigrinya | tir_Ethi | 949,608 | 0.14 | 1,276,536 | 0.24 |
Swati | ssw_Latn | 950,564 | 0.14 | 2,195,002 | 0.41 |
Malayalam | mal_Mlym | 953,984 | 0.14 | 1,225,083 | 0.23 |
Nigerian Fulfulde | fuv_Latn | 956,328 | 0.14 | 2,126,652 | 0.4 |
Umbundu | umb_Latn | 974,104 | 0.14 | 2,264,553 | 0.43 |
Ganda | lug_Latn | 975,780 | 0.14 | 2,273,481 | 0.43 |
Northern Sotho | nso_Latn | 978,484 | 0.14 | 2,250,971 | 0.42 |
Khmer | khm_Khmr | 984,756 | 0.14 | 1,227,825 | 0.23 |
Luo | luo_Latn | 993,068 | 0.15 | 2,249,242 | 0.42 |
Standard Tibetan | bod_Tibt | 993,732 | 0.15 | 1,223,481 | 0.23 |
Tswana | tsn_Latn | 1,009,328 | 0.15 | 2,323,481 | 0.44 |
Kinyarwanda | kin_Latn | 1,010,752 | 0.15 | 2,273,481 | 0.43 |
Sinhala | sin_Sinh | 1,012,012 | 0.15 | 1,256,582 | 0.24 |
Xhosa | xho_Latn | 1,019,804 | 0.15 | 2,323,481 | 0.44 |
Shona | sna_Latn | 1,026,320 | 0.15 | 2,273,481 | 0.43 |
Esperanto | epo_Latn | 1,029,444 | 0.15 | 2,612,083 | 0.49 |
Tsonga | tso_Latn | 1,031,856 | 0.15 | 2,323,481 | 0.44 |
Dzongkha | dzo_Tibt | 1,033,552 | 0.15 | 1,223,481 | 0.23 |
Zulu | zul_Latn | 1,039,296 | 0.15 | 2,323,481 | 0.44 |
Serbian | srp_Cyrl | 1,040,024 | 0.15 | 1,362,598 | 0.26 |
Nyanja | nya_Latn | 1,061,780 | 0.16 | 2,323,481 | 0.44 |
Shan | shn_Mymr | 1,074,940 | 0.16 | 1,223,481 | 0.23 |
Igbo | ibo_Latn | 1,095,300 | 0.16 | 2,282,301 | 0.43 |
Hausa | hau_Latn | 1,112,272 | 0.16 | 2,335,738 | 0.44 |
West Central Oromo | gaz_Latn | 1,115,600 | 0.16 | 2,343,260 | 0.44 |
Nepali | npi_Deva | 1,144,676 | 0.17 | 1,281,430 | 0.24 |
Yoruba | yor_Latn | 1,164,540 | 0.17 | 2,334,801 | 0.44 |
Southern Pashto | pbt_Arab | 1,170,840 | 0.17 | 1,365,533 | 0.26 |
Somali | som_Latn | 1,198,320 | 0.18 | 2,482,437 | 0.47 |
Burmese | mya_Mymr | 1,228,196 | 0.18 | 1,279,882 | 0.24 |
Amharic | amh_Ethi | 1,261,128 | 0.19 | 1,980,215 | 0.37 |
Eastern Panjabi | pan_Guru | 1,305,636 | 0.19 | 1,307,897 | 0.25 |
Gujarati | guj_Gujr | 1,331,780 | 0.2 | 1,317,314 | 0.25 |
Marathi | mar_Deva | 1,494,024 | 0.22 | 1,443,950 | 0.27 |
Bengali | ben_Beng | 1,650,272 | 0.24 | 1,411,514 | 0.27 |
Chinese (Traditional) | zho_Hant | 1,778,736 | 0.26 | 1,956,189 | 0.37 |
Tamil | tam_Taml | 1,833,328 | 0.27 | 1,394,473 | 0.26 |
Swahili | swh_Latn | 1,970,784 | 0.29 | 4,185,608 | 0.79 |
Telugu | tel_Telu | 2,224,480 | 0.33 | 1,573,325 | 0.3 |
Ukrainian | ukr_Cyrl | 2,227,616 | 0.33 | 2,216,119 | 0.42 |
Western Persian | pes_Arab | 2,389,340 | 0.35 | 1,811,121 | 0.34 |
Turkish | tur_Latn | 3,106,600 | 0.46 | 4,146,153 | 0.78 |
Urdu | urd_Arab | 3,553,960 | 0.52 | 3,513,218 | 0.66 |
Korean | kor_Hang | 4,642,468 | 0.68 | 3,415,920 | 0.64 |
Python | python | 4,728,504 | 0.7 | 3,142,962 | 0.59 |
Japanese | jpn_Jpan | 5,079,788 | 0.75 | 4,193,570 | 0.79 |
Thai | tha_Thai | 6,860,704 | 1.01 | 4,666,299 | 0.88 |
Chinese (Simplified) | zho_Hans | 8,063,684 | 1.19 | 7,355,509 | 1.38 |
Vietnamese | vie_Latn | 8,398,824 | 1.24 | 6,194,925 | 1.16 |
Indonesian | ind_Latn | 9,380,144 | 1.38 | 5,301,812 | 1.0 |
Hindi | hin_Deva | 9,914,328 | 1.46 | 5,612,176 | 1.05 |
Croatian | hrv_Latn | 10,028,028 | 1.48 | 5,583,975 | 1.05 |
Modern Standard Arabic | arb_Arab | 11,051,064 | 1.63 | 7,232,551 | 1.36 |
Romanian | ron_Latn | 11,441,636 | 1.68 | 5,594,927 | 1.05 |
Maltese | mlt_Latn | 11,614,488 | 1.71 | 5,513,885 | 1.04 |
Slovenian | slv_Latn | 12,014,912 | 1.77 | 5,533,689 | 1.04 |
Estonian | est_Latn | 12,126,212 | 1.79 | 5,584,057 | 1.05 |
Lithuanian | lit_Latn | 12,253,976 | 1.8 | 5,603,047 | 1.05 |
Slovak | slk_Latn | 12,286,300 | 1.81 | 5,513,481 | 1.04 |
Standard Latvian | lvs_Latn | 12,298,584 | 1.81 | 5,517,287 | 1.04 |
Polish | pol_Latn | 12,409,684 | 1.83 | 5,868,631 | 1.1 |
Hungarian | hun_Latn | 12,607,420 | 1.86 | 6,086,621 | 1.14 |
Russian | rus_Cyrl | 13,110,908 | 1.93 | 8,798,927 | 1.65 |
Czech | ces_Latn | 14,316,052 | 2.11 | 6,418,462 | 1.21 |
Bulgarian | bul_Cyrl | 14,615,468 | 2.15 | 7,265,885 | 1.37 |
Swedish | swe_Latn | 14,646,656 | 2.16 | 5,634,363 | 1.06 |
Finnish | fin_Latn | 15,011,464 | 2.21 | 6,077,501 | 1.14 |
Danish | dan_Latn | 16,136,612 | 2.38 | 5,831,109 | 1.1 |
Dutch | nld_Latn | 22,387,020 | 3.3 | 8,992,864 | 1.69 |
Greek | ell_Grek | 23,144,296 | 3.41 | 7,224,001 | 1.36 |
Italian | ita_Latn | 23,952,824 | 3.53 | 9,967,738 | 1.87 |
Portuguese | por_Latn | 27,297,252 | 4.02 | 11,242,808 | 2.11 |
German | deu_Latn | 27,909,808 | 4.11 | 15,806,969 | 2.97 |
French | fra_Latn | 28,428,608 | 4.18 | 16,365,984 | 3.08 |
Spanish | spa_Latn | 30,969,580 | 4.56 | 16,315,928 | 3.07 |
English | eng_Latn | 69,530,384 | 10.24 | 53,015,690 | 9.96 |
Total | - | 679,318,704 | 100 | 532,107,156 | 100 |
The dataset collection is released under Apache 2.0. Note that individual datasets may have different licenses.
@article{muennighoff2022crosslingual, title={Crosslingual generalization through multitask finetuning}, author={Muennighoff, Niklas and Wang, Thomas and Sutawika, Lintang and Roberts, Adam and Biderman, Stella and Scao, Teven Le and Bari, M Saiful and Shen, Sheng and Yong, Zheng-Xin and Schoelkopf, Hailey and others}, journal={arXiv preprint arXiv:2211.01786}, year={2022} }
Thanks to the contributors of promptsource for adding many prompts used in this dataset. Thanks to the Aya team @ C4AI ?