数据集:

Muennighoff/xP3x

任务:

task_categories:other

语言:

计算机处理:

multilingual

大小:

100M<n<1B

批注创建人:

expert-generated crowdsourced

预印本库:

arxiv:2211.01786

许可:

apache-2.0

数据集介绍文件清单

中文

Dataset Card for xP3x

Dataset Summary

xP3x (Crosslingual Public Pool of Prompts eXtended) is a collection of prompts & datasets across 277 languages & 16 NLP tasks. It contains all of xP3 + much more! It is used for training future contenders of mT0 & BLOOMZ at project Aya @ C4AI 🧡

Creation: The dataset can be recreated using instructions available here together with the file in this repository named xp3x_create.py . We provide this version to save processing time.
Languages: 277
xP3 Dataset Family:

Name	Explanation	Example models
xP3x	Mixture of 17 tasks in 277 languages with English prompts	WIP - Join us at Project Aya @ C4AI to help!
xP3	Mixture of 13 training tasks in 46 languages with English prompts	bloomz & mt0-xxl
xP3mt	Mixture of 13 training tasks in 46 languages with prompts in 20 languages (machine-translated from English)	bloomz-mt & mt0-xxl-mt
xP3all	xP3 + evaluation datasets adding an additional 3 tasks for a total of 16 tasks in 46 languages with English prompts
xP3megds	Megatron-DeepSpeed processed version of xP3	bloomz
P3	Repreprocessed version of the English-only P3 with 8 training tasks	bloomz-p3 & mt0-xxl-p3

Dataset Structure

Data Instances

An example looks as follows:

{
  'inputs': '11月、遂にクロームはファイヤーフォックスを引き離し始めた。_はインターネットユーザーの評価が高まったのだ。\nReplace the _ in the above sentence with the correct option: \n- ファイヤーフォックス\n- クローム',
  'targets': 'クローム',
  'language': 'jpn_Jpan',
  'split': 'test',
  'template': 'Replace',
  'dataset': 'Muennighoff/xwinograd',
  'config': 'jp'
}

Data Fields

The data fields are the same among all splits:

inputs : the natural language input fed to the model
targets : the natural language target that the model has to generate
language : The language code. The codes are an extension of the FLORES-200 codes, where the first part is the language code and the second part the script code.
template : The name of the prompt used.
dataset : The Hugging Face dataset identifier of where the data stems from.
config : The config of the Hugging Face dataset.

Usage

The dataset has 680 gigabytes and 530 million samples. You may want to filter it and then deduplicate depending on your needs.

Loading by language:

# pip install -q datasets
from datasets import load_dataset
ds = load_dataset("Muennighoff/xP3x", "zho_Hans", streaming=True) # Use streaming to not download all at once
for x in ds["train"]:
    print(x)
    break

You can then filter down by the data fields to e.g. only get certain configs or datasets. As every dataset-config-template is its own jsonl file, you can also decide on the datasets, configs and templates you want and only download them. For example, to download all Japanese xwinograd samples, you could do:

# pip install -q datasets
from datasets import load_dataset
import multiprocessing
# pip install --upgrade huggingface-hub
from huggingface_hub import HfFileSystem, hf_hub_url

fs = HfFileSystem()
fps = fs.glob(f"datasets/Muennighoff/xP3x/data/jpn_Jpan/*xwinograd*")
resolved_paths = [fs.resolve_path(file) for file in fps]
data_files = [hf_hub_url(resolved_path.repo_id, resolved_path.path_in_repo, repo_type=resolved_path.repo_type) for resolved_path in resolved_paths]

ds = load_dataset("json", data_files=data_files, num_proc=8)["train"]

Data Splits

Language	Code	Kilobytes	%	Samples	%
Emilian	egl_Latn	104	0.0	402	0.0
Swiss German	gsw_Latn	104	0.0	408	0.0
Novial	nov_Latn	116	0.0	432	0.0
Ainu (Latin script)	ain_Latn	120	0.0	410	0.0
Chamorro	cha_Latn	120	0.0	452	0.0
Gothic	got_Goth	120	0.0	402	0.0
Prussian	prg_Latn	120	0.0	424	0.0
Picard	pcd_Latn	140	0.0	530	0.0
Northern Frisian	frr_Latn	156	0.0	554	0.0
Uzbek (Latin script)	uzb_Latn	156	0.0	600	0.0
Ottoman Turkish (Latin script)	ota_Latn	188	0.0	632	0.0
Swahili (macrolanguage)	swa_Latn	212	0.0	772	0.0
Talossan	tzl_Latn	220	0.0	836	0.0
Kven Finnish	fkv_Latn	260	0.0	910	0.0
Zaza	zza_Latn	260	0.0	1,056	0.0
Frisian	fry_Latn	268	0.0	956	0.0
Piemontese	pms_Latn	276	0.0	998	0.0
Kalmyk	xal_Cyrl	288	0.0	976	0.0
Hunsrik	hrx_Latn	352	0.0	1,380	0.0
Romany	rom_Latn	364	0.0	1,410	0.0
Ancient Greek (to 1453)	grc_Grek	392	0.0	1,226	0.0
Tase Naga	nst_Latn	424	0.0	1,608	0.0
Albanian	sqi_Latn	596	0.0	2,216	0.0
Guadeloupean Creole French	gcf_Latn	608	0.0	2,326	0.0
Yakut	sah_Cyrl	608	0.0	1,986	0.0
Ho (Latin script)	hoc_Latn	632	0.0	2,634	0.0
Khasi	kha_Latn	676	0.0	2,664	0.0
Algerian Arabic	arq_Arab	688	0.0	2,278	0.0
Lower Sorbian	dsb_Latn	692	0.0	2,596	0.0
Chuvash	chv_Cyrl	716	0.0	2,446	0.0
Old Russian	orv_Cyrl	752	0.0	2,586	0.0
Pampanga	pam_Latn	784	0.0	2,984	0.0
Kurdish (Latin script)	kur_Latn	796	0.0	3,050	0.0
Ottoman Turkish	ota_Arab	832	0.0	2,772	0.0
Kotava	avk_Latn	864	0.0	3,118	0.0
Upper Sorbian	hsb_Latn	900	0.0	3,474	0.0
Buryat	bua_Cyrl	924	0.0	3,218	0.0
Swabian	swg_Latn	996	0.0	3,366	0.0
Coastal Kadazan	kzj_Latn	1,136	0.0	3,766	0.0
Chavacano	cbk_Latn	1,352	0.0	4,994	0.0
Quechua	que_Latn	1,704	0.0	5,312	0.0
Lingua Franca Nova (Cyrillic script)	lfn_Cyrl	1,740	0.0	5,458	0.0
Gronings	gos_Latn	1,864	0.0	7,462	0.0
Volapük	vol_Latn	1,948	0.0	7,712	0.0
Yue Chinese (Simplified)	yue_Hans	2,300	0.0	7,872	0.0
Mari (Russia)	chm_Cyrl	2,540	0.0	7,496	0.0
Kadazan Dusun	dtp_Latn	2,548	0.0	8,892	0.0
Breton	bre_Latn	3,048	0.0	11,868	0.0
Ladino	lad_Latn	3,224	0.0	11,916	0.0
Cornish	cor_Latn	3,492	0.0	13,880	0.0
Interlingue	ile_Latn	3,700	0.0	14,468	0.0
Wu Chinese	wuu_Hans	3,784	0.0	13,062	0.0
Japanese (Katakana)	jpn_Kana	4,208	0.0	13,942	0.0
Ido	ido_Latn	6,180	0.0	23,742	0.0
Yiddishi	yid_Hebr	9,896	0.0	34,412	0.01
Klingon	tlh_Latn	11,716	0.0	46,010	0.01
Lingua Franca Nova	lfn_Latn	13,328	0.0	46,826	0.01
Lojban	jbo_Latn	17,468	0.0	66,694	0.01
Low German	nds_Latn	18,364	0.0	68,098	0.01
Interlingua (International Auxiliary Language Association)	ina_Latn	25,700	0.0	76,584	0.01
Java	java	25,904	0.0	13,551	0.0
Japanese (Kanji)	jpn_Hani	26,292	0.0	89,978	0.02
Norwegian	nor_Latn	26,724	0.0	93,116	0.02
Toki Pona	toki_Latn	26,808	0.0	97,170	0.02
Latin	lat_Latn	28,900	0.0	101,390	0.02
Serbo-Croatian	hbs_Latn	29,452	0.0	105,748	0.02
Nigerian Pidgin	pcm_Latn	145,872	0.02	88,992	0.02
Azerbaijani (South or North; Latin script)	aze_Latn	147,564	0.02	77,875	0.01
Serbian (Latin script)	srp_Latn	179,072	0.03	131,101	0.02
Japanese (Hiragana)	jpn_Hira	188,944	0.03	628,758	0.12
Berber (Latin script)	ber_Latn	201,464	0.03	693,602	0.13
Jupyter Notebook	jupyter_notebook	416,056	0.06	400,000	0.08
Yue Chinese	yue_Hant	613,352	0.09	1,227,429	0.23
Haitian Creole	hat_Latn	629,420	0.09	1,228,281	0.23
Mossi	mos_Latn	630,416	0.09	1,223,481	0.23
Pangasinan	pag_Latn	630,684	0.09	1,223,481	0.23
Twi	twi_Latn	631,172	0.09	1,223,481	0.23
Bosnian	bos_Latn	633,016	0.09	1,224,479	0.23
Ewe	ewe_Latn	633,292	0.09	1,223,481	0.23
Bambara	bam_Latn	634,520	0.09	1,223,481	0.23
Javanese	jav_Latn	635,248	0.09	1,224,003	0.23
Southwestern Dinka	dik_Latn	635,416	0.09	1,223,481	0.23
Kabuverdianu	kea_Latn	636,144	0.09	1,223,481	0.23
Dyula	dyu_Latn	636,464	0.09	1,223,481	0.23
Venetian	vec_Latn	637,412	0.09	1,223,481	0.23
Chokwe	cjk_Latn	637,532	0.09	1,223,481	0.23
Latgalian	ltg_Latn	637,612	0.09	1,223,481	0.23
Sundanese	sun_Latn	638,120	0.09	1,223,481	0.23
Asturian	ast_Latn	638,708	0.09	1,223,481	0.23
Akan	aka_Latn	639,648	0.09	1,223,481	0.23
Mizo	lus_Latn	639,680	0.09	1,223,481	0.23
Guarani	grn_Latn	641,540	0.09	1,225,647	0.23
Limburgish	lim_Latn	642,368	0.09	1,223,481	0.23
Faroese	fao_Latn	642,432	0.09	1,224,067	0.23
Buginese	bug_Latn	643,472	0.09	1,223,481	0.23
Sango	sag_Latn	643,596	0.09	1,223,481	0.23
Luba-Kasai	lua_Latn	643,640	0.09	1,223,481	0.23
Papiamento	pap_Latn	643,648	0.09	1,223,481	0.23
Silesian	szl_Latn	644,608	0.09	1,223,481	0.23
Sicilian	scn_Latn	645,636	0.1	1,223,481	0.23
Kimbundu	kmb_Latn	645,964	0.1	1,223,481	0.23
Basque	eus_Latn	646,084	0.1	1,246,877	0.23
Balinese	ban_Latn	646,408	0.1	1,223,481	0.23
Norwegian Nynorsk	nno_Latn	646,996	0.1	1,229,699	0.23
Central Aymara	ayr_Latn	647,236	0.1	1,223,481	0.23
Tamasheq (Latin script)	taq_Latn	648,656	0.1	1,223,481	0.23
Kikongo	kon_Latn	648,992	0.1	1,223,481	0.23
Friulian	fur_Latn	649,272	0.1	1,223,481	0.23
Ayacucho Quechua	quy_Latn	649,992	0.1	1,223,481	0.23
Maori	mri_Latn	650,336	0.1	1,224,211	0.23
Icelandic	isl_Latn	650,372	0.1	1,246,623	0.23
Galician	glg_Latn	652,088	0.1	1,233,291	0.23
Catalan	cat_Latn	652,116	0.1	1,241,381	0.23
Lombard	lmo_Latn	652,120	0.1	1,223,481	0.23
Banjar (Latin script)	bjn_Latn	652,372	0.1	1,223,481	0.23
Fijian	fij_Latn	652,796	0.1	1,223,481	0.23
Crimean Tatar	crh_Latn	653,920	0.1	1,223,895	0.23
Northern Kurdish	kmr_Latn	654,108	0.1	1,223,481	0.23
Ligurian	lij_Latn	654,432	0.1	1,223,481	0.23
Occitan	oci_Latn	655,676	0.1	1,227,945	0.23
Turkmen	tuk_Latn	658,672	0.1	1,241,205	0.23
Luxembourgish	ltz_Latn	658,768	0.1	1,225,339	0.23
Cebuano	ceb_Latn	659,124	0.1	1,226,039	0.23
Samoan	smo_Latn	659,704	0.1	1,223,481	0.23
Sardinian	srd_Latn	660,000	0.1	1,223,481	0.23
Bemba	bem_Latn	660,504	0.1	1,223,481	0.23
Minangkabau (Latin script)	min_Latn	660,672	0.1	1,223,481	0.23
Acehnese (Latin script)	ace_Latn	661,084	0.1	1,223,481	0.23
Ilocano	ilo_Latn	661,184	0.1	1,227,663	0.23
Irish	gle_Latn	661,660	0.1	1,227,357	0.23
Fon	fon_Latn	663,124	0.1	1,223,481	0.23
Waray	war_Latn	664,120	0.1	1,226,503	0.23
Norwegian Bokmål	nob_Latn	666,240	0.1	1,300,607	0.24
Tosk Albanian	als_Latn	666,692	0.1	1,223,481	0.23
Standard Malay	zsm_Latn	667,088	0.1	1,270,715	0.24
Southern Sotho	sot_Latn	667,728	0.1	1,223,481	0.23
Kabyle	kab_Latn	668,128	0.1	1,346,605	0.25
Jingpho	kac_Latn	669,464	0.1	1,223,481	0.23
Lingala	lin_Latn	670,428	0.1	1,323,481	0.25
Wolof	wol_Latn	670,568	0.1	1,373,481	0.26
Central Kanuri (Latin script)	knc_Latn	670,800	0.1	1,223,481	0.23
Kikuyu	kik_Latn	672,096	0.1	1,223,481	0.23
Tok Pisin	tpi_Latn	672,916	0.1	1,223,481	0.23
Nuer	nus_Latn	673,632	0.1	1,223,481	0.23
Tagalog	tgl_Latn	673,684	0.1	1,247,417	0.23
Tumbuka	tum_Latn	676,948	0.1	1,223,481	0.23
Plateau Malagasy	plt_Latn	677,852	0.1	1,223,481	0.23
Afrikaans	afr_Latn	679,164	0.1	1,337,091	0.25
North Azerbaijani	azj_Latn	679,820	0.1	1,223,481	0.23
Kabiyè	kbp_Latn	684,880	0.1	1,223,481	0.23
Modern Standard Arabic (Romanized)	arb_Latn	685,408	0.1	1,223,481	0.23
Scottish Gaelic	gla_Latn	708,620	0.1	1,243,627	0.23
Sindhi	snd_Arab	718,680	0.11	1,223,481	0.23
North Levantine Arabic	apc_Arab	720,048	0.11	1,223,481	0.23
Tunisian Arabic	aeb_Arab	720,360	0.11	1,223,481	0.23
South Levantine Arabic	ajp_Arab	720,488	0.11	1,223,481	0.23
Dari	prs_Arab	720,500	0.11	1,223,481	0.23
Moroccan Arabic	ary_Arab	722,904	0.11	1,223,481	0.23
Egyptian Arabic	arz_Arab	723,356	0.11	1,223,481	0.23
Najdi Arabic	ars_Arab	725,784	0.11	1,223,481	0.23
Acehnese (Arabic script)	ace_Arab	726,272	0.11	1,223,481	0.23
Mesopotamian Arabic	acm_Arab	728,472	0.11	1,223,481	0.23
Ta’izzi-Adeni Arabic	acq_Arab	734,780	0.11	1,223,481	0.23
South Azerbaijani	azb_Arab	735,728	0.11	1,223,481	0.23
Central Kanuri (Arabic script)	knc_Arab	746,936	0.11	1,223,481	0.23
Rundi	run_Latn	749,792	0.11	1,296,111	0.24
Banjar (Arabic script)	bjn_Arab	751,112	0.11	1,223,481	0.23
Central Kurdish	ckb_Arab	756,804	0.11	1,223,481	0.23
Bashkir	bak_Cyrl	758,816	0.11	1,223,481	0.23
Kashmiri (Arabic script)	kas_Arab	759,140	0.11	1,223,481	0.23
Tatar	tat_Cyrl	764,212	0.11	1,247,685	0.23
Minangkabau (Arabic script)	min_Arab	765,384	0.11	1,223,481	0.23
Kazakh	kaz_Cyrl	766,176	0.11	1,232,697	0.23
Halh Mongolian	khk_Cyrl	776,384	0.11	1,224,353	0.23
Tajik	tgk_Cyrl	780,452	0.11	1,223,481	0.23
Eastern Yiddish	ydd_Hebr	781,452	0.12	1,223,481	0.23
Uyghur	uig_Arab	785,444	0.12	1,256,999	0.24
Armenian	hye_Armn	789,952	0.12	1,228,171	0.23
Hebrew	heb_Hebr	793,144	0.12	1,604,365	0.3
Belarusian	bel_Cyrl	806,588	0.12	1,261,197	0.24
Macedonian	mkd_Cyrl	813,436	0.12	1,384,567	0.26
Welsh	cym_Latn	821,036	0.12	1,321,455	0.25
Northern Uzbek	uzn_Latn	835,560	0.12	1,273,404	0.24
Central Atlas Tamazight	tzm_Tfng	843,508	0.12	1,223,481	0.23
Tamasheq (Tifinagh script)	taq_Tfng	848,104	0.12	1,223,481	0.23
Magahi	mag_Deva	851,360	0.13	1,223,481	0.23
Bhojpuri	bho_Deva	854,848	0.13	1,223,481	0.23
Awadhi	awa_Deva	857,096	0.13	1,224,037	0.23
Chhattisgarhi	hne_Deva	859,332	0.13	1,223,481	0.23
Kyrgyz	kir_Cyrl	860,700	0.13	1,250,163	0.23
Maithili	mai_Deva	863,476	0.13	1,223,481	0.23
Assamese	asm_Beng	865,904	0.13	1,223,481	0.23
Kashmiri (Devanagari script)	kas_Deva	867,232	0.13	1,223,481	0.23
Sanskrit	san_Deva	879,236	0.13	1,223,481	0.23
Lao	lao_Laoo	888,240	0.13	1,223,481	0.23
Odia	ory_Orya	890,508	0.13	1,223,481	0.23
Santali	sat_Olck	902,300	0.13	1,223,481	0.23
Kannada	kan_Knda	909,260	0.13	1,223,481	0.23
Meitei (Bengali script)	mni_Beng	917,984	0.14	1,223,481	0.23
Georgian	kat_Geor	928,712	0.14	1,226,729	0.23
Kamba	kam_Latn	936,468	0.14	2,136,615	0.4
Tigrinya	tir_Ethi	949,608	0.14	1,276,536	0.24
Swati	ssw_Latn	950,564	0.14	2,195,002	0.41
Malayalam	mal_Mlym	953,984	0.14	1,225,083	0.23
Nigerian Fulfulde	fuv_Latn	956,328	0.14	2,126,652	0.4
Umbundu	umb_Latn	974,104	0.14	2,264,553	0.43
Ganda	lug_Latn	975,780	0.14	2,273,481	0.43
Northern Sotho	nso_Latn	978,484	0.14	2,250,971	0.42
Khmer	khm_Khmr	984,756	0.14	1,227,825	0.23
Luo	luo_Latn	993,068	0.15	2,249,242	0.42
Standard Tibetan	bod_Tibt	993,732	0.15	1,223,481	0.23
Tswana	tsn_Latn	1,009,328	0.15	2,323,481	0.44
Kinyarwanda	kin_Latn	1,010,752	0.15	2,273,481	0.43
Sinhala	sin_Sinh	1,012,012	0.15	1,256,582	0.24
Xhosa	xho_Latn	1,019,804	0.15	2,323,481	0.44
Shona	sna_Latn	1,026,320	0.15	2,273,481	0.43
Esperanto	epo_Latn	1,029,444	0.15	2,612,083	0.49
Tsonga	tso_Latn	1,031,856	0.15	2,323,481	0.44
Dzongkha	dzo_Tibt	1,033,552	0.15	1,223,481	0.23
Zulu	zul_Latn	1,039,296	0.15	2,323,481	0.44
Serbian	srp_Cyrl	1,040,024	0.15	1,362,598	0.26
Nyanja	nya_Latn	1,061,780	0.16	2,323,481	0.44
Shan	shn_Mymr	1,074,940	0.16	1,223,481	0.23
Igbo	ibo_Latn	1,095,300	0.16	2,282,301	0.43
Hausa	hau_Latn	1,112,272	0.16	2,335,738	0.44
West Central Oromo	gaz_Latn	1,115,600	0.16	2,343,260	0.44
Nepali	npi_Deva	1,144,676	0.17	1,281,430	0.24
Yoruba	yor_Latn	1,164,540	0.17	2,334,801	0.44
Southern Pashto	pbt_Arab	1,170,840	0.17	1,365,533	0.26
Somali	som_Latn	1,198,320	0.18	2,482,437	0.47
Burmese	mya_Mymr	1,228,196	0.18	1,279,882	0.24
Amharic	amh_Ethi	1,261,128	0.19	1,980,215	0.37
Eastern Panjabi	pan_Guru	1,305,636	0.19	1,307,897	0.25
Gujarati	guj_Gujr	1,331,780	0.2	1,317,314	0.25
Marathi	mar_Deva	1,494,024	0.22	1,443,950	0.27
Bengali	ben_Beng	1,650,272	0.24	1,411,514	0.27
Chinese (Traditional)	zho_Hant	1,778,736	0.26	1,956,189	0.37
Tamil	tam_Taml	1,833,328	0.27	1,394,473	0.26
Swahili	swh_Latn	1,970,784	0.29	4,185,608	0.79
Telugu	tel_Telu	2,224,480	0.33	1,573,325	0.3
Ukrainian	ukr_Cyrl	2,227,616	0.33	2,216,119	0.42
Western Persian	pes_Arab	2,389,340	0.35	1,811,121	0.34
Turkish	tur_Latn	3,106,600	0.46	4,146,153	0.78
Urdu	urd_Arab	3,553,960	0.52	3,513,218	0.66
Korean	kor_Hang	4,642,468	0.68	3,415,920	0.64
Python	python	4,728,504	0.7	3,142,962	0.59
Japanese	jpn_Jpan	5,079,788	0.75	4,193,570	0.79
Thai	tha_Thai	6,860,704	1.01	4,666,299	0.88
Chinese (Simplified)	zho_Hans	8,063,684	1.19	7,355,509	1.38
Vietnamese	vie_Latn	8,398,824	1.24	6,194,925	1.16
Indonesian	ind_Latn	9,380,144	1.38	5,301,812	1.0
Hindi	hin_Deva	9,914,328	1.46	5,612,176	1.05
Croatian	hrv_Latn	10,028,028	1.48	5,583,975	1.05
Modern Standard Arabic	arb_Arab	11,051,064	1.63	7,232,551	1.36
Romanian	ron_Latn	11,441,636	1.68	5,594,927	1.05
Maltese	mlt_Latn	11,614,488	1.71	5,513,885	1.04
Slovenian	slv_Latn	12,014,912	1.77	5,533,689	1.04
Estonian	est_Latn	12,126,212	1.79	5,584,057	1.05
Lithuanian	lit_Latn	12,253,976	1.8	5,603,047	1.05
Slovak	slk_Latn	12,286,300	1.81	5,513,481	1.04
Standard Latvian	lvs_Latn	12,298,584	1.81	5,517,287	1.04
Polish	pol_Latn	12,409,684	1.83	5,868,631	1.1
Hungarian	hun_Latn	12,607,420	1.86	6,086,621	1.14
Russian	rus_Cyrl	13,110,908	1.93	8,798,927	1.65
Czech	ces_Latn	14,316,052	2.11	6,418,462	1.21
Bulgarian	bul_Cyrl	14,615,468	2.15	7,265,885	1.37
Swedish	swe_Latn	14,646,656	2.16	5,634,363	1.06
Finnish	fin_Latn	15,011,464	2.21	6,077,501	1.14
Danish	dan_Latn	16,136,612	2.38	5,831,109	1.1
Dutch	nld_Latn	22,387,020	3.3	8,992,864	1.69
Greek	ell_Grek	23,144,296	3.41	7,224,001	1.36
Italian	ita_Latn	23,952,824	3.53	9,967,738	1.87
Portuguese	por_Latn	27,297,252	4.02	11,242,808	2.11
German	deu_Latn	27,909,808	4.11	15,806,969	2.97
French	fra_Latn	28,428,608	4.18	16,365,984	3.08
Spanish	spa_Latn	30,969,580	4.56	16,315,928	3.07
English	eng_Latn	69,530,384	10.24	53,015,690	9.96
Total	-	679,318,704	100	532,107,156	100

Language specifics

Japanese : Data in jpn_Hira , jpn_Kana , jpn_Hani is guaranteed to have Hiragana, Katakana or Kanji, respectively in each sample. However, they may still include other styles. So while all samples in jpn_Kana are guaranteed to have Katakana, there may still be Hiragana or Kanji.

Dataset Creation

Source Data

Training datasets

Code Miscellaneous
Closed-book QA
Extractive QA
- Adversarial QA
- CMRC2018
- DRCD
- DuoRC
- MLQA
- Quoref
- ReCoRD
- ROPES
- SQuAD v2
- xQuAD
- TyDI QA
  - Primary
  - Goldp
Multiple-Choice QA
- ARC
- C3
- CoS-E
- Cosmos
- DREAM
- MultiRC
- OpenBookQA
- PiQA
- QUAIL
- QuaRel
- QuaRTz
- QASC
- RACE
- SciQ
- Social IQA
- Wiki Hop
- WiQA
Paraphrase Identification
- MRPC
- PAWS
- PAWS-X
- QQP
Program Synthesis
Structure-to-text
- Common Gen
- Wiki Bio
Sentiment
- Amazon
- App Reviews
- IMDB
- Rotten Tomatoes
- Yelp
Simplification
- BiSECT
Summarization
- CNN Daily Mail
- Gigaword
- MultiNews
- SamSum
- Wiki-Lingua
- XLSum
- XSum
Topic Classification
- AG News
- DBPedia
- TNEWS
- TREC
- CSL
Translation
Word Sense disambiguation
- WiC
- XL-WiC
Natural Language Inference (NLI)
- ANLI
- CB
- RTE
- XNLI
Coreference Resolution
- Winogrande
- XWinograd
Sentence Completion
- COPA
- Story Cloze
- XCOPA
- XStoryCloze

Dataset specifics

Flores-200: There are three prompts for Flores: continuation , question , command , which represent three commonly used prompting styles, i.e. making a prompt seem like a natural continuation, turning it into a question or commanding the model to do something.
tatoeba_mt: Contains duplicates. For example, it has data that is both classified as jpn_Kana and jpn_Jpan , so you may want to deduplicate.

Additional Information

Licensing Information

The dataset collection is released under Apache 2.0. Note that individual datasets may have different licenses.

Citation Information

@article{muennighoff2022crosslingual,
  title={Crosslingual generalization through multitask finetuning},
  author={Muennighoff, Niklas and Wang, Thomas and Sutawika, Lintang and Roberts, Adam and Biderman, Stella and Scao, Teven Le and Bari, M Saiful and Shen, Sheng and Yong, Zheng-Xin and Schoelkopf, Hailey and others},
  journal={arXiv preprint arXiv:2211.01786},
  year={2022}
}

Contributions

Thanks to the contributors of promptsource for adding many prompts used in this dataset. Thanks to the Aya team @ C4AI 🧡

作者:

Muennighoff

数据集大小:

4.34 GB