数据集:
bigscience/xP3
xP3 (Crosslingual Public Pool of Prompts) is a collection of prompts & datasets across 46 of languages & 16 NLP tasks. It is used for the training of BLOOMZ and mT0, multilingual language models capable of following human instructions in dozens of languages zero-shot.
Name | Explanation | Example models |
---|---|---|
xP3x | Mixture of 17 tasks in 277 languages with English prompts | WIP - Join us at Project Aya @ C4AI to help! |
xP3 | Mixture of 13 training tasks in 46 languages with English prompts | bloomz & mt0-xxl |
xP3mt | Mixture of 13 training tasks in 46 languages with prompts in 20 languages (machine-translated from English) | bloomz-mt & mt0-xxl-mt |
xP3all | xP3 + evaluation datasets adding an additional 3 tasks for a total of 16 tasks in 46 languages with English prompts | |
xP3megds | Megatron-DeepSpeed processed version of xP3 | bloomz |
P3 | Repreprocessed version of the English-only P3 with 8 training tasks | bloomz-p3 & mt0-xxl-p3 |
An example of "train" looks as follows:
{ "inputs": "Sentence 1: Fue académico en literatura metafísica, teología y ciencias clásicas.\nSentence 2: Fue académico en literatura metafísica, teología y ciencia clásica.\nQuestion: Can we rewrite Sentence 1 to Sentence 2? Yes or No?", "targets": "Yes" }
The data fields are the same among all splits:
The below table summarizes sizes per language (computed from the merged_{lang}.jsonl files). Due to languages like tw only being single sentence translation samples from Flores, their byte percentage is significantly lower than their sample percentage. Adding a new language is very simple, you can take this script adding Russian as an example.
Language | Kilobytes | % | Samples | % |
---|---|---|---|---|
tw | 106288 | 0.11 | 265071 | 0.34 |
bm | 107056 | 0.11 | 265180 | 0.34 |
ak | 108096 | 0.11 | 265071 | 0.34 |
eu | 108112 | 0.11 | 269973 | 0.34 |
ca | 110608 | 0.12 | 271191 | 0.34 |
fon | 113072 | 0.12 | 265063 | 0.34 |
st | 114080 | 0.12 | 265063 | 0.34 |
ki | 115040 | 0.12 | 265180 | 0.34 |
tum | 116032 | 0.12 | 265063 | 0.34 |
wo | 122560 | 0.13 | 365063 | 0.46 |
ln | 126304 | 0.13 | 365060 | 0.46 |
as | 156256 | 0.16 | 265063 | 0.34 |
or | 161472 | 0.17 | 265063 | 0.34 |
kn | 165456 | 0.17 | 265063 | 0.34 |
ml | 175040 | 0.18 | 265864 | 0.34 |
rn | 192992 | 0.2 | 318189 | 0.4 |
nso | 229712 | 0.24 | 915051 | 1.16 |
tn | 235536 | 0.25 | 915054 | 1.16 |
lg | 235936 | 0.25 | 915021 | 1.16 |
rw | 249360 | 0.26 | 915043 | 1.16 |
ts | 250256 | 0.26 | 915044 | 1.16 |
sn | 252496 | 0.27 | 865056 | 1.1 |
xh | 254672 | 0.27 | 915058 | 1.16 |
zu | 263712 | 0.28 | 915061 | 1.16 |
ny | 272128 | 0.29 | 915063 | 1.16 |
ig | 325232 | 0.34 | 950097 | 1.2 |
yo | 352784 | 0.37 | 918416 | 1.16 |
ne | 393680 | 0.41 | 315754 | 0.4 |
pa | 523248 | 0.55 | 339210 | 0.43 |
gu | 560688 | 0.59 | 347499 | 0.44 |
sw | 560896 | 0.59 | 1114455 | 1.41 |
mr | 666240 | 0.7 | 417269 | 0.53 |
bn | 832720 | 0.88 | 428843 | 0.54 |
ta | 924496 | 0.97 | 410633 | 0.52 |
te | 1332912 | 1.4 | 573364 | 0.73 |
ur | 1918272 | 2.02 | 855756 | 1.08 |
vi | 3101408 | 3.27 | 1667306 | 2.11 |
code | 4330752 | 4.56 | 2707724 | 3.43 |
hi | 4393696 | 4.63 | 1543441 | 1.96 |
zh | 4589904 | 4.83 | 3560556 | 4.51 |
id | 4606288 | 4.85 | 2627392 | 3.33 |
ar | 4677264 | 4.93 | 2148955 | 2.72 |
fr | 5546688 | 5.84 | 5055942 | 6.41 |
pt | 6129584 | 6.46 | 3562772 | 4.52 |
es | 7571808 | 7.98 | 5151349 | 6.53 |
en | 37261104 | 39.25 | 31495184 | 39.93 |
total | 94941936 | 100.0 | 78883588 | 100.0 |
The dataset is released under Apache 2.0.
@article{muennighoff2022crosslingual, title={Crosslingual generalization through multitask finetuning}, author={Muennighoff, Niklas and Wang, Thomas and Sutawika, Lintang and Roberts, Adam and Biderman, Stella and Scao, Teven Le and Bari, M Saiful and Shen, Sheng and Yong, Zheng-Xin and Schoelkopf, Hailey and others}, journal={arXiv preprint arXiv:2211.01786}, year={2022} }
Thanks to the contributors of promptsource for adding many prompts used in this dataset.