数据集:

MBZUAI/Bactrian-X

任务:

文本生成

语言:

预印本库:

arxiv:2008.00401 arxiv:2305.15011

其他:

instruction-finetuning multilingual

许可:

cc-by-nc-4.0

数据集介绍文件清单

中文

Dataset Card for "Bactrian-X"

A. Dataset Description

Homepage: https://github.com/mbzuai-nlp/Bactrian-X
Repository: https://huggingface.co/datasets/MBZUAI/Bactrian-X
Paper: to-be-soon released

Dataset Summary

The Bactrain-X dataset is a collection of 3.4M instruction-response pairs in 52 languages, that are obtained by translating 67K English instructions ( alpaca-52k + dolly-15k ) into 51 languages using Google Translate API. The translated instructions are then fed to ChatGPT ( gpt-3.5-turbo ) to obtain its natural responses, resulting in 3.4M instruction-response pairs in 52 languages (52 languages x 67k instances = 3.4M instances).

Languages

We follow 52 languages of mBART-50 , with details can be found below .

B. Dataset Structure

Data Fields

The data fields are as follows:

instruction : describes the task the model should perform. Each of the 67K instructions is unique.
input : optional context or input for the task. For example, when the instruction is "Summarize the following article", the input is the article. Around 40% of the examples have an input.
output : the answer to the instruction as generated by gpt-3.5-turbo .
id : the unique index for each data piece in a specific language. The same id across splits corresponds to the same original English data.

Data Instances

An example in en looks as follows.

{
"instruction": "What is the main idea of this text?", 
"input": "The robotics industry has become one of the fastest growing sectors in today’s economy. Robotics technology is being applied in various industries and applications, such as manufacturing, healthcare, and defense.", 
"output": "The main idea of the text is that the robotics industry is rapidly growing and its technology is being used in many different industries and applications.", 
"id": "alpaca-107"
}

An example in ar looks as follows.

{
"instruction": "قم بالوصول إلى إعدادات المتصفح الافتراضية على جهاز الكمبيوتر.", 
"input": "",  
"output": "لا يمكن لي كمساعد افتراضي القيام بهذه المهمة كونها تحتاج إلى الوصول إلى الجهاز المادي. يرجى الرجوع إلى دليل المستخدم للكمبيوتر الخاص بك أو البحث على الإنترنت لمعرفة كيفية الوصول إلى إعدادات المتصفح الافتراضية على جهاز الكمبيوتر الخاص بك.",
"id": "alpaca-41997"
}

An example in id looks as follows.

{
"instruction": "Berapa banyak angka penting yang harus digunakan untuk nilai 0,000187?", 
"input": "",  
"output": "Ada tiga angka penting dalam nilai 0,000187.",
"id": "alpaca-40382"
}

An example in de looks as follows.

{
"instruction": "Komponieren Sie ein Musikstück mit einem Titel, der Trauer bezeichnet.", 
"input": "", 
"output": ""Schmerzendes Herz"\n\nDas Stück beginnt mit einer langsamen, traurigen Melodie, gespielt von einem Solo-Violoncello. Die Harmonien, die von einem Klavier begleitet werden, sind durcheinander und dissonant, um ein Gefühl von Schmerz und Trauer zu vermitteln.\n\nIm Laufe des Stücks treten andere Instrumente ein, wie Streicher, Flöte und Oboe, um die Melodie zu unterstützen und zu intensivieren. Der Rhythmus bleibt langsam und schwerfällig, um die Schwere des Schmerzes darzustellen.\n\nIn der Mitte des Stücks gibt es einen Höhepunkt, wo die Melodie und Harmonien sich verdichten und ein Gefühl von Überwältigung erzeugen. Der Solo-Violoncello spielt eine traurige und schnelle Phrase, die die Verzweiflung und Isolation des Protagonisten ausdrückt.\n\nDas Stück endet mit einer Wiederholung der ersten Melodie, gespielt von einem Solo-Klavier, während die anderen Instrumente auslaufen. Die Harmonien werden jedoch weicher und sanfter, um einen Hauch von Hoffnung zu vermitteln, dass der Schmerz langsam vergeht.",
"id": "alpaca-33619"
}

An example in fr looks as follows.

{
"instruction": "Les Ceanothus ont-ils des racines peu profondes ?", 
"input": "",
"output": "En tant qu'IA, je ne peux pas répondre avec une certitude absolue, mais voici des informations générales sur les racines de Ceanothus : \n\nLes Ceanothus ont tendance à avoir un système racinaire étendu et peu profond. Les racines peuvent s'étaler jusqu'à 4 à 6 fois la taille de la plante et s'étendre à environ 60 à 90 cm de profondeur. Cependant, cela peut varier en fonction du type de sol et de la variété de Ceanothus. Certaines variétés de Ceanothus pourraient avoir des racines plus profondes si les conditions s'y prêtent.",
"id": "dolly-10940"
}

Data in 52 Languages

No	Languages	Code	Train
1	Afrikaans	af_ZA	67017
2	Arabic	ar_AR	67017
3	Azerbaijani	az_AZ	67017
4	Bengali	bn_IN	67017
5	Czech	cs_CZ	67017
6	German	de_DE	67017
7	English	en_XX	67017
8	Spanish	es_XX	67017
9	Estonian	et_EE	67017
10	Persian	fa_IR	67017
11	Finnish	fi_FI	67017
12	French	fr_XX	67017
13	Galician	gl_ES	67017
14	Gujarati	gu_IN	67017
15	Hebrew	he_IL	67017
16	Hindi	hi_IN	67017
17	Croatian	hr_HR	67017
18	Indonesian	id_ID	67017
19	Italian	it_IT	67017
20	Japanese	ja_XX	67017
21	Georgian	ka_GE	67017
22	Kazakh	kk_KZ	67017
23	Khmer	km_KH	67017
24	Korean	ko_KR	67017
25	Lithuanian	lt_LT	67017
26	Latvian	lv_LV	67017
27	Macedonian	mk_MK	67017
28	Malayalam	ml_IN	67017
29	Mongolian	mn_MN	67017
30	Marathi	mr_IN	67017
31	Burmese	my_MM	67017
32	Nepali	ne_NP	67017
33	Dutch	nl_XX	67017
34	Polish	pl_PL	67017
35	Pashto	ps_AF	67017
36	Portuguese	pt_XX	67017
37	Romanian	ro_RO	67017
38	Russian	ru_RU	67017
39	Sinhala	si_LK	67017
40	Slovene	sl_SI	67017
41	Swedish	sv_SE	67017
42	Swahili	sw_KE	67017
43	Tamil	ta_IN	67017
44	Telugu	te_IN	67017
45	Thai	th_TH	67017
46	Tagalog	tl_XX	67017
47	Turkish	tr_TR	67017
48	Ukrainian	uk_UA	67017
49	Urdu	ur_PK	67017
50	Vietnamese	vi_VN	67017
51	Xhosa	xh_ZA	67017
52	Chinese	zh_CN	67017

C. Dataset Creation

English Instructions: The English instuctions are obtained from alpaca-53k , and dolly-15k .

Instruction Translation: The instructions (and inputs) are translated into 51 languages using Google Translation API (conducted on April 2023).

Output Generation: We generate output from gpt-3.5-turbo for each language (conducted on April 2023).

D. Considerations for Using the Data

Social Impact of Dataset

NLP for everyone: this dataset helps to democratize the cutting-edge instruction-following models in 52 languages. This dataset also allows the first experiment on the multilingual LoRA-based LLaMA model.

Discussion of Biases

(1) Translation bias; (2) Potential English-culture bias in the translated dataset.

Other Known Limitations

The Bactrian-X data is generated by a language model ( gpt-3.5-turbo ) and inevitably contains some errors or biases. We encourage users to use this data with caution and propose new methods to filter or improve the imperfections.

E. Additional Information

Dataset Curators

Haonan Li and Fajri Koto

Licensing Information

The dataset is available under the Creative Commons NonCommercial (CC BY-NC 4.0) .

Citation Information

@misc{li2023bactrianx,
      title={Bactrian-X : A Multilingual Replicable Instruction-Following Model with Low-Rank Adaptation}, 
      author={Haonan Li and Fajri Koto and Minghao Wu and Alham Fikri Aji and Timothy Baldwin},
      year={2023},
      eprint={2305.15011},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contributions

Thanks to @haonan-li , @fajri91 for adding this dataset.

作者:

MBZUAI

数据集大小:

1.08 GB