数据集:

bigscience/xP3

中文

Dataset Card for xP3

Dataset Summary

xP3 (Crosslingual Public Pool of Prompts) is a collection of prompts & datasets across 46 of languages & 16 NLP tasks. It is used for the training of BLOOMZ and mT0, multilingual language models capable of following human instructions in dozens of languages zero-shot.

  • Creation: The dataset can be recreated using instructions available here . We provide this version to save processing time and ease reproducibility.
  • Languages: 46 (Can be extended by recreating with more splits )
  • xP3 Dataset Family:
Name Explanation Example models
xP3x Mixture of 17 tasks in 277 languages with English prompts WIP - Join us at Project Aya @ C4AI to help!
xP3 Mixture of 13 training tasks in 46 languages with English prompts bloomz & mt0-xxl
xP3mt Mixture of 13 training tasks in 46 languages with prompts in 20 languages (machine-translated from English) bloomz-mt & mt0-xxl-mt
xP3all xP3 + evaluation datasets adding an additional 3 tasks for a total of 16 tasks in 46 languages with English prompts
xP3megds Megatron-DeepSpeed processed version of xP3 bloomz
P3 Repreprocessed version of the English-only P3 with 8 training tasks bloomz-p3 & mt0-xxl-p3

Dataset Structure

Data Instances

An example of "train" looks as follows:

{
"inputs": "Sentence 1: Fue académico en literatura metafísica, teología y ciencias clásicas.\nSentence 2: Fue académico en literatura metafísica, teología y ciencia clásica.\nQuestion: Can we rewrite Sentence 1 to Sentence 2? Yes or No?",
"targets": "Yes" 
}

Data Fields

The data fields are the same among all splits:

  • inputs : the natural language input fed to the model
  • targets : the natural language target that the model has to generate

Data Splits

The below table summarizes sizes per language (computed from the merged_{lang}.jsonl files). Due to languages like tw only being single sentence translation samples from Flores, their byte percentage is significantly lower than their sample percentage. Adding a new language is very simple, you can take this script adding Russian as an example.

Language Kilobytes % Samples %
tw 106288 0.11 265071 0.34
bm 107056 0.11 265180 0.34
ak 108096 0.11 265071 0.34
eu 108112 0.11 269973 0.34
ca 110608 0.12 271191 0.34
fon 113072 0.12 265063 0.34
st 114080 0.12 265063 0.34
ki 115040 0.12 265180 0.34
tum 116032 0.12 265063 0.34
wo 122560 0.13 365063 0.46
ln 126304 0.13 365060 0.46
as 156256 0.16 265063 0.34
or 161472 0.17 265063 0.34
kn 165456 0.17 265063 0.34
ml 175040 0.18 265864 0.34
rn 192992 0.2 318189 0.4
nso 229712 0.24 915051 1.16
tn 235536 0.25 915054 1.16
lg 235936 0.25 915021 1.16
rw 249360 0.26 915043 1.16
ts 250256 0.26 915044 1.16
sn 252496 0.27 865056 1.1
xh 254672 0.27 915058 1.16
zu 263712 0.28 915061 1.16
ny 272128 0.29 915063 1.16
ig 325232 0.34 950097 1.2
yo 352784 0.37 918416 1.16
ne 393680 0.41 315754 0.4
pa 523248 0.55 339210 0.43
gu 560688 0.59 347499 0.44
sw 560896 0.59 1114455 1.41
mr 666240 0.7 417269 0.53
bn 832720 0.88 428843 0.54
ta 924496 0.97 410633 0.52
te 1332912 1.4 573364 0.73
ur 1918272 2.02 855756 1.08
vi 3101408 3.27 1667306 2.11
code 4330752 4.56 2707724 3.43
hi 4393696 4.63 1543441 1.96
zh 4589904 4.83 3560556 4.51
id 4606288 4.85 2627392 3.33
ar 4677264 4.93 2148955 2.72
fr 5546688 5.84 5055942 6.41
pt 6129584 6.46 3562772 4.52
es 7571808 7.98 5151349 6.53
en 37261104 39.25 31495184 39.93
total 94941936 100.0 78883588 100.0

Dataset Creation

Source Data

Training datasets Evaluation datasets (included in xP3all except for NLI datasets & HumanEval)

Additional Information

Licensing Information

The dataset is released under Apache 2.0.

Citation Information

@article{muennighoff2022crosslingual,
  title={Crosslingual generalization through multitask finetuning},
  author={Muennighoff, Niklas and Wang, Thomas and Sutawika, Lintang and Roberts, Adam and Biderman, Stella and Scao, Teven Le and Bari, M Saiful and Shen, Sheng and Yong, Zheng-Xin and Schoelkopf, Hailey and others},
  journal={arXiv preprint arXiv:2211.01786},
  year={2022}
}

Contributions

Thanks to the contributors of promptsource for adding many prompts used in this dataset.