数据集:
bigscience/xP3
xP3 (Crosslingual Public Pool of Prompts) is a collection of prompts & datasets across 46 of languages & 16 NLP tasks. It is used for the training of BLOOMZ and mT0, multilingual language models capable of following human instructions in dozens of languages zero-shot.
| Name | Explanation | Example models |
|---|---|---|
| xP3x | Mixture of 17 tasks in 277 languages with English prompts | WIP - Join us at Project Aya @ C4AI to help! |
| xP3 | Mixture of 13 training tasks in 46 languages with English prompts | bloomz & mt0-xxl |
| xP3mt | Mixture of 13 training tasks in 46 languages with prompts in 20 languages (machine-translated from English) | bloomz-mt & mt0-xxl-mt |
| xP3all | xP3 + evaluation datasets adding an additional 3 tasks for a total of 16 tasks in 46 languages with English prompts | |
| xP3megds | Megatron-DeepSpeed processed version of xP3 | bloomz |
| P3 | Repreprocessed version of the English-only P3 with 8 training tasks | bloomz-p3 & mt0-xxl-p3 |
An example of "train" looks as follows:
{
"inputs": "Sentence 1: Fue académico en literatura metafísica, teología y ciencias clásicas.\nSentence 2: Fue académico en literatura metafísica, teología y ciencia clásica.\nQuestion: Can we rewrite Sentence 1 to Sentence 2? Yes or No?",
"targets": "Yes"
}
The data fields are the same among all splits:
The below table summarizes sizes per language (computed from the merged_{lang}.jsonl files). Due to languages like tw only being single sentence translation samples from Flores, their byte percentage is significantly lower than their sample percentage. Adding a new language is very simple, you can take this script adding Russian as an example.
| Language | Kilobytes | % | Samples | % |
|---|---|---|---|---|
| tw | 106288 | 0.11 | 265071 | 0.34 |
| bm | 107056 | 0.11 | 265180 | 0.34 |
| ak | 108096 | 0.11 | 265071 | 0.34 |
| eu | 108112 | 0.11 | 269973 | 0.34 |
| ca | 110608 | 0.12 | 271191 | 0.34 |
| fon | 113072 | 0.12 | 265063 | 0.34 |
| st | 114080 | 0.12 | 265063 | 0.34 |
| ki | 115040 | 0.12 | 265180 | 0.34 |
| tum | 116032 | 0.12 | 265063 | 0.34 |
| wo | 122560 | 0.13 | 365063 | 0.46 |
| ln | 126304 | 0.13 | 365060 | 0.46 |
| as | 156256 | 0.16 | 265063 | 0.34 |
| or | 161472 | 0.17 | 265063 | 0.34 |
| kn | 165456 | 0.17 | 265063 | 0.34 |
| ml | 175040 | 0.18 | 265864 | 0.34 |
| rn | 192992 | 0.2 | 318189 | 0.4 |
| nso | 229712 | 0.24 | 915051 | 1.16 |
| tn | 235536 | 0.25 | 915054 | 1.16 |
| lg | 235936 | 0.25 | 915021 | 1.16 |
| rw | 249360 | 0.26 | 915043 | 1.16 |
| ts | 250256 | 0.26 | 915044 | 1.16 |
| sn | 252496 | 0.27 | 865056 | 1.1 |
| xh | 254672 | 0.27 | 915058 | 1.16 |
| zu | 263712 | 0.28 | 915061 | 1.16 |
| ny | 272128 | 0.29 | 915063 | 1.16 |
| ig | 325232 | 0.34 | 950097 | 1.2 |
| yo | 352784 | 0.37 | 918416 | 1.16 |
| ne | 393680 | 0.41 | 315754 | 0.4 |
| pa | 523248 | 0.55 | 339210 | 0.43 |
| gu | 560688 | 0.59 | 347499 | 0.44 |
| sw | 560896 | 0.59 | 1114455 | 1.41 |
| mr | 666240 | 0.7 | 417269 | 0.53 |
| bn | 832720 | 0.88 | 428843 | 0.54 |
| ta | 924496 | 0.97 | 410633 | 0.52 |
| te | 1332912 | 1.4 | 573364 | 0.73 |
| ur | 1918272 | 2.02 | 855756 | 1.08 |
| vi | 3101408 | 3.27 | 1667306 | 2.11 |
| code | 4330752 | 4.56 | 2707724 | 3.43 |
| hi | 4393696 | 4.63 | 1543441 | 1.96 |
| zh | 4589904 | 4.83 | 3560556 | 4.51 |
| id | 4606288 | 4.85 | 2627392 | 3.33 |
| ar | 4677264 | 4.93 | 2148955 | 2.72 |
| fr | 5546688 | 5.84 | 5055942 | 6.41 |
| pt | 6129584 | 6.46 | 3562772 | 4.52 |
| es | 7571808 | 7.98 | 5151349 | 6.53 |
| en | 37261104 | 39.25 | 31495184 | 39.93 |
| total | 94941936 | 100.0 | 78883588 | 100.0 |
The dataset is released under Apache 2.0.
@article{muennighoff2022crosslingual,
title={Crosslingual generalization through multitask finetuning},
author={Muennighoff, Niklas and Wang, Thomas and Sutawika, Lintang and Roberts, Adam and Biderman, Stella and Scao, Teven Le and Bari, M Saiful and Shen, Sheng and Yong, Zheng-Xin and Schoelkopf, Hailey and others},
journal={arXiv preprint arXiv:2211.01786},
year={2022}
}
Thanks to the contributors of promptsource for adding many prompts used in this dataset.