数据集:

bigscience/xP3mt

任务:

task_categories:other

语言:

计算机处理:

multilingual

大小:

100M<n<1B

批注创建人:

expert-generated crowdsourced

预印本库:

arxiv:2211.01786

许可:

apache-2.0

数据集介绍文件清单

中文

Dataset Card for xP3

Dataset Summary

xP3 (Crosslingual Public Pool of Prompts) is a collection of prompts & datasets across 46 of languages & 16 NLP tasks. It is used for the training of BLOOMZ and mT0, multilingual language models capable of following human instructions in dozens of languages zero-shot.

Creation: The dataset can be recreated using instructions available here . We provide this version to save processing time and ease reproducibility.
Languages: 46 (Can be extended by recreating with more splits )
xP3 Dataset Family:

Name	Explanation	Example models
xP3x	Mixture of 17 tasks in 277 languages with English prompts	WIP - Join us at Project Aya @ C4AI to help!
xP3	Mixture of 13 training tasks in 46 languages with English prompts	bloomz & mt0-xxl
xP3mt	Mixture of 13 training tasks in 46 languages with prompts in 20 languages (machine-translated from English)	bloomz-mt & mt0-xxl-mt
xP3all	xP3 + evaluation datasets adding an additional 3 tasks for a total of 16 tasks in 46 languages with English prompts
xP3megds	Megatron-DeepSpeed processed version of xP3	bloomz
P3	Repreprocessed version of the English-only P3 with 8 training tasks	bloomz-p3 & mt0-xxl-p3

Dataset Structure

Data Instances

An example of "train" looks as follows:

{
"inputs": "Oración 1: Fue académico en literatura metafísica, teología y ciencias clásicas.\Oración 2: Fue académico en literatura metafísica, teología y ciencia clásica.\nPregunta: ¿La oración 1 parafrasea la oración 2? ¿Si o no?",
"targets": "Sí" 
}

Data Fields

The data fields are the same among all splits:

inputs : the natural language input fed to the model
targets : the natural language target that the model has to generate

Data Splits

The below table summarizes sizes per language (computed from the merged_{lang}.jsonl files). Due to languages like tw only being single sentence translation samples from Flores, their byte percentage is significantly lower than their sample percentage. We machine-translated prompts for monolingual datasets, thus languages with only crosslingual datasets (e.g. Translation) do not have non-English prompts. Languages without non-English prompts are equivalent to xP3 .

Language	Kilobytes	%	Samples	%	Non-English prompts
tw	106288	0.11	265071	0.33
bm	107056	0.11	265180	0.33
ak	108096	0.11	265071	0.33
ca	110608	0.11	271191	0.34
eu	113008	0.12	281199	0.35
fon	113072	0.12	265063	0.33
st	114080	0.12	265063	0.33
ki	115040	0.12	265180	0.33
tum	116032	0.12	265063	0.33
wo	122560	0.13	365063	0.46
ln	126304	0.13	365060	0.46
as	156256	0.16	265063	0.33
or	161472	0.17	265063	0.33
kn	165456	0.17	265063	0.33
ml	175040	0.18	265864	0.33
rn	192992	0.2	318189	0.4
nso	229712	0.24	915051	1.14
tn	235536	0.24	915054	1.14
lg	235936	0.24	915021	1.14
rw	249360	0.26	915043	1.14
ts	250256	0.26	915044	1.14
sn	252496	0.26	865056	1.08
xh	254672	0.26	915058	1.14
zu	263712	0.27	915061	1.14
ny	272128	0.28	915063	1.14
ig	325440	0.33	950097	1.19	✅
yo	339664	0.35	913021	1.14	✅
ne	398144	0.41	315754	0.39	✅
pa	529632	0.55	339210	0.42	✅
sw	561392	0.58	1114439	1.39	✅
gu	566576	0.58	347499	0.43	✅
mr	674000	0.69	417269	0.52	✅
bn	854864	0.88	428725	0.54	✅
ta	943440	0.97	410633	0.51	✅
te	1384016	1.42	573354	0.72	✅
ur	1944416	2.0	855756	1.07	✅
vi	3113184	3.2	1667306	2.08	✅
code	4330752	4.46	2707724	3.38
hi	4469712	4.6	1543441	1.93	✅
id	4538768	4.67	2582272	3.22	✅
zh	4604112	4.74	3571636	4.46	✅
ar	4703968	4.84	2148970	2.68	✅
fr	5558912	5.72	5055942	6.31	✅
pt	6130016	6.31	3562772	4.45	✅
es	7579424	7.8	5151349	6.43	✅
en	39252528	40.4	32740750	40.87
total	97150128	100.0	80100816	100.0	✅

Dataset Creation

Source Data

Training datasets

Code Miscellaneous
Closed-book QA
Extractive QA
- Adversarial QA
- CMRC2018
- DRCD
- DuoRC
- MLQA
- Quoref
- ReCoRD
- ROPES
- SQuAD v2
- xQuAD
- TyDI QA
  - Primary
  - Goldp
Multiple-Choice QA
- ARC
- C3
- CoS-E
- Cosmos
- DREAM
- MultiRC
- OpenBookQA
- PiQA
- QUAIL
- QuaRel
- QuaRTz
- QASC
- RACE
- SciQ
- Social IQA
- Wiki Hop
- WiQA
Paraphrase Identification
- MRPC
- PAWS
- PAWS-X
- QQP
Program Synthesis
Structure-to-text
- Common Gen
- Wiki Bio
Sentiment
- Amazon
- App Reviews
- IMDB
- Rotten Tomatoes
- Yelp
Simplification
- BiSECT
Summarization
- CNN Daily Mail
- Gigaword
- MultiNews
- SamSum
- Wiki-Lingua
- XLSum
- XSum
Topic Classification
- AG News
- DBPedia
- TNEWS
- TREC
- CSL
Translation
- Flores-200
- Tatoeba
Word Sense disambiguation
- WiC
- XL-WiC

Evaluation datasets (included in xP3all except for NLI & HumanEval)

Natural Language Inference (NLI)
- ANLI
- CB
- RTE
- XNLI
Coreference Resolution
- Winogrande
- XWinograd
Program Synthesis
- HumanEval
Sentence Completion
- COPA
- Story Cloze
- XCOPA
- XStoryCloze

Additional Information

Licensing Information

The dataset is released under Apache 2.0.

Citation Information

@misc{muennighoff2022crosslingual,
      title={Crosslingual Generalization through Multitask Finetuning}, 
      author={Niklas Muennighoff and Thomas Wang and Lintang Sutawika and Adam Roberts and Stella Biderman and Teven Le Scao and M Saiful Bari and Sheng Shen and Zheng-Xin Yong and Hailey Schoelkopf and Xiangru Tang and Dragomir Radev and Alham Fikri Aji and Khalid Almubarak and Samuel Albanie and Zaid Alyafeai and Albert Webson and Edward Raff and Colin Raffel},
      year={2022},
      eprint={2211.01786},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contributions

Thanks to the contributors of promptsource for adding many prompts used in this dataset.

作者:

bigscience

数据集大小:

111.11 GB