数据集:
bigscience/xP3mt
xP3 (Crosslingual Public Pool of Prompts) is a collection of prompts & datasets across 46 of languages & 16 NLP tasks. It is used for the training of BLOOMZ and mT0, multilingual language models capable of following human instructions in dozens of languages zero-shot.
Name | Explanation | Example models |
---|---|---|
xP3x | Mixture of 17 tasks in 277 languages with English prompts | WIP - Join us at Project Aya @ C4AI to help! |
xP3 | Mixture of 13 training tasks in 46 languages with English prompts | bloomz & mt0-xxl |
xP3mt | Mixture of 13 training tasks in 46 languages with prompts in 20 languages (machine-translated from English) | bloomz-mt & mt0-xxl-mt |
xP3all | xP3 + evaluation datasets adding an additional 3 tasks for a total of 16 tasks in 46 languages with English prompts | |
xP3megds | Megatron-DeepSpeed processed version of xP3 | bloomz |
P3 | Repreprocessed version of the English-only P3 with 8 training tasks | bloomz-p3 & mt0-xxl-p3 |
An example of "train" looks as follows:
{ "inputs": "Oración 1: Fue académico en literatura metafísica, teología y ciencias clásicas.\Oración 2: Fue académico en literatura metafísica, teología y ciencia clásica.\nPregunta: ¿La oración 1 parafrasea la oración 2? ¿Si o no?", "targets": "Sí" }
The data fields are the same among all splits:
The below table summarizes sizes per language (computed from the merged_{lang}.jsonl files). Due to languages like tw only being single sentence translation samples from Flores, their byte percentage is significantly lower than their sample percentage. We machine-translated prompts for monolingual datasets, thus languages with only crosslingual datasets (e.g. Translation) do not have non-English prompts. Languages without non-English prompts are equivalent to xP3 .
Language | Kilobytes | % | Samples | % | Non-English prompts |
---|---|---|---|---|---|
tw | 106288 | 0.11 | 265071 | 0.33 | |
bm | 107056 | 0.11 | 265180 | 0.33 | |
ak | 108096 | 0.11 | 265071 | 0.33 | |
ca | 110608 | 0.11 | 271191 | 0.34 | |
eu | 113008 | 0.12 | 281199 | 0.35 | |
fon | 113072 | 0.12 | 265063 | 0.33 | |
st | 114080 | 0.12 | 265063 | 0.33 | |
ki | 115040 | 0.12 | 265180 | 0.33 | |
tum | 116032 | 0.12 | 265063 | 0.33 | |
wo | 122560 | 0.13 | 365063 | 0.46 | |
ln | 126304 | 0.13 | 365060 | 0.46 | |
as | 156256 | 0.16 | 265063 | 0.33 | |
or | 161472 | 0.17 | 265063 | 0.33 | |
kn | 165456 | 0.17 | 265063 | 0.33 | |
ml | 175040 | 0.18 | 265864 | 0.33 | |
rn | 192992 | 0.2 | 318189 | 0.4 | |
nso | 229712 | 0.24 | 915051 | 1.14 | |
tn | 235536 | 0.24 | 915054 | 1.14 | |
lg | 235936 | 0.24 | 915021 | 1.14 | |
rw | 249360 | 0.26 | 915043 | 1.14 | |
ts | 250256 | 0.26 | 915044 | 1.14 | |
sn | 252496 | 0.26 | 865056 | 1.08 | |
xh | 254672 | 0.26 | 915058 | 1.14 | |
zu | 263712 | 0.27 | 915061 | 1.14 | |
ny | 272128 | 0.28 | 915063 | 1.14 | |
ig | 325440 | 0.33 | 950097 | 1.19 | ✅ |
yo | 339664 | 0.35 | 913021 | 1.14 | ✅ |
ne | 398144 | 0.41 | 315754 | 0.39 | ✅ |
pa | 529632 | 0.55 | 339210 | 0.42 | ✅ |
sw | 561392 | 0.58 | 1114439 | 1.39 | ✅ |
gu | 566576 | 0.58 | 347499 | 0.43 | ✅ |
mr | 674000 | 0.69 | 417269 | 0.52 | ✅ |
bn | 854864 | 0.88 | 428725 | 0.54 | ✅ |
ta | 943440 | 0.97 | 410633 | 0.51 | ✅ |
te | 1384016 | 1.42 | 573354 | 0.72 | ✅ |
ur | 1944416 | 2.0 | 855756 | 1.07 | ✅ |
vi | 3113184 | 3.2 | 1667306 | 2.08 | ✅ |
code | 4330752 | 4.46 | 2707724 | 3.38 | |
hi | 4469712 | 4.6 | 1543441 | 1.93 | ✅ |
id | 4538768 | 4.67 | 2582272 | 3.22 | ✅ |
zh | 4604112 | 4.74 | 3571636 | 4.46 | ✅ |
ar | 4703968 | 4.84 | 2148970 | 2.68 | ✅ |
fr | 5558912 | 5.72 | 5055942 | 6.31 | ✅ |
pt | 6130016 | 6.31 | 3562772 | 4.45 | ✅ |
es | 7579424 | 7.8 | 5151349 | 6.43 | ✅ |
en | 39252528 | 40.4 | 32740750 | 40.87 | |
total | 97150128 | 100.0 | 80100816 | 100.0 | ✅ |
The dataset is released under Apache 2.0.
@misc{muennighoff2022crosslingual, title={Crosslingual Generalization through Multitask Finetuning}, author={Niklas Muennighoff and Thomas Wang and Lintang Sutawika and Adam Roberts and Stella Biderman and Teven Le Scao and M Saiful Bari and Sheng Shen and Zheng-Xin Yong and Hailey Schoelkopf and Xiangru Tang and Dragomir Radev and Alham Fikri Aji and Khalid Almubarak and Samuel Albanie and Zaid Alyafeai and Albert Webson and Edward Raff and Colin Raffel}, year={2022}, eprint={2211.01786}, archivePrefix={arXiv}, primaryClass={cs.CL} }
Thanks to the contributors of promptsource for adding many prompts used in this dataset.