数据集:

bigscience/xP3

任务:

task_categories:other

语言:

计算机处理:

multilingual

大小:

100M<n<1B

批注创建人:

expert-generated crowdsourced

预印本库:

arxiv:2211.01786

许可:

apache-2.0

数据集介绍文件清单

中文

Dataset Card for xP3

Dataset Summary

xP3 (Crosslingual Public Pool of Prompts) is a collection of prompts & datasets across 46 of languages & 16 NLP tasks. It is used for the training of BLOOMZ and mT0, multilingual language models capable of following human instructions in dozens of languages zero-shot.

Creation: The dataset can be recreated using instructions available here . We provide this version to save processing time and ease reproducibility.
Languages: 46 (Can be extended by recreating with more splits )
xP3 Dataset Family:

Name	Explanation	Example models
xP3x	Mixture of 17 tasks in 277 languages with English prompts	WIP - Join us at Project Aya @ C4AI to help!
xP3	Mixture of 13 training tasks in 46 languages with English prompts	bloomz & mt0-xxl
xP3mt	Mixture of 13 training tasks in 46 languages with prompts in 20 languages (machine-translated from English)	bloomz-mt & mt0-xxl-mt
xP3all	xP3 + evaluation datasets adding an additional 3 tasks for a total of 16 tasks in 46 languages with English prompts
xP3megds	Megatron-DeepSpeed processed version of xP3	bloomz
P3	Repreprocessed version of the English-only P3 with 8 training tasks	bloomz-p3 & mt0-xxl-p3

Dataset Structure

Data Instances

An example of "train" looks as follows:

{
"inputs": "Sentence 1: Fue académico en literatura metafísica, teología y ciencias clásicas.\nSentence 2: Fue académico en literatura metafísica, teología y ciencia clásica.\nQuestion: Can we rewrite Sentence 1 to Sentence 2? Yes or No?",
"targets": "Yes" 
}

Data Fields

The data fields are the same among all splits:

inputs : the natural language input fed to the model
targets : the natural language target that the model has to generate

Data Splits

The below table summarizes sizes per language (computed from the merged_{lang}.jsonl files). Due to languages like tw only being single sentence translation samples from Flores, their byte percentage is significantly lower than their sample percentage. Adding a new language is very simple, you can take this script adding Russian as an example.

Language	Kilobytes	%	Samples	%
tw	106288	0.11	265071	0.34
bm	107056	0.11	265180	0.34
ak	108096	0.11	265071	0.34
eu	108112	0.11	269973	0.34
ca	110608	0.12	271191	0.34
fon	113072	0.12	265063	0.34
st	114080	0.12	265063	0.34
ki	115040	0.12	265180	0.34
tum	116032	0.12	265063	0.34
wo	122560	0.13	365063	0.46
ln	126304	0.13	365060	0.46
as	156256	0.16	265063	0.34
or	161472	0.17	265063	0.34
kn	165456	0.17	265063	0.34
ml	175040	0.18	265864	0.34
rn	192992	0.2	318189	0.4
nso	229712	0.24	915051	1.16
tn	235536	0.25	915054	1.16
lg	235936	0.25	915021	1.16
rw	249360	0.26	915043	1.16
ts	250256	0.26	915044	1.16
sn	252496	0.27	865056	1.1
xh	254672	0.27	915058	1.16
zu	263712	0.28	915061	1.16
ny	272128	0.29	915063	1.16
ig	325232	0.34	950097	1.2
yo	352784	0.37	918416	1.16
ne	393680	0.41	315754	0.4
pa	523248	0.55	339210	0.43
gu	560688	0.59	347499	0.44
sw	560896	0.59	1114455	1.41
mr	666240	0.7	417269	0.53
bn	832720	0.88	428843	0.54
ta	924496	0.97	410633	0.52
te	1332912	1.4	573364	0.73
ur	1918272	2.02	855756	1.08
vi	3101408	3.27	1667306	2.11
code	4330752	4.56	2707724	3.43
hi	4393696	4.63	1543441	1.96
zh	4589904	4.83	3560556	4.51
id	4606288	4.85	2627392	3.33
ar	4677264	4.93	2148955	2.72
fr	5546688	5.84	5055942	6.41
pt	6129584	6.46	3562772	4.52
es	7571808	7.98	5151349	6.53
en	37261104	39.25	31495184	39.93
total	94941936	100.0	78883588	100.0

Dataset Creation

Source Data

Training datasets

Code Miscellaneous
Closed-book QA
Extractive QA
- Adversarial QA
- CMRC2018
- DRCD
- DuoRC
- MLQA
- Quoref
- ReCoRD
- ROPES
- SQuAD v2
- xQuAD
- TyDI QA
  - Primary
  - Goldp
Multiple-Choice QA
- ARC
- C3
- CoS-E
- Cosmos
- DREAM
- MultiRC
- OpenBookQA
- PiQA
- QUAIL
- QuaRel
- QuaRTz
- QASC
- RACE
- SciQ
- Social IQA
- Wiki Hop
- WiQA
Paraphrase Identification
- MRPC
- PAWS
- PAWS-X
- QQP
Program Synthesis
Structure-to-text
- Common Gen
- Wiki Bio
Sentiment
- Amazon
- App Reviews
- IMDB
- Rotten Tomatoes
- Yelp
Simplification
- BiSECT
Summarization
- CNN Daily Mail
- Gigaword
- MultiNews
- SamSum
- Wiki-Lingua
- XLSum
- XSum
Topic Classification
- AG News
- DBPedia
- TNEWS
- TREC
- CSL
Translation
- Flores-200
- Tatoeba
Word Sense disambiguation
- WiC
- XL-WiC

Evaluation datasets (included in xP3all except for NLI datasets & HumanEval)

Natural Language Inference (NLI)
- ANLI
- CB
- RTE
- XNLI
Coreference Resolution
- Winogrande
- XWinograd
Program Synthesis
- HumanEval
Sentence Completion
- COPA
- Story Cloze
- XCOPA
- XStoryCloze

Additional Information

Licensing Information

The dataset is released under Apache 2.0.

Citation Information

@article{muennighoff2022crosslingual,
  title={Crosslingual generalization through multitask finetuning},
  author={Muennighoff, Niklas and Wang, Thomas and Sutawika, Lintang and Roberts, Adam and Biderman, Stella and Scao, Teven Le and Bari, M Saiful and Shen, Sheng and Yong, Zheng-Xin and Schoelkopf, Hailey and others},
  journal={arXiv preprint arXiv:2211.01786},
  year={2022}
}

Contributions

Thanks to the contributors of promptsource for adding many prompts used in this dataset.

作者:

bigscience

数据集大小:

108.63 GB