中文

Dataset Card for "Bactrian-X"

A. Dataset Description

Dataset Summary

The Bactrain-X dataset is a collection of 3.4M instruction-response pairs in 52 languages, that are obtained by translating 67K English instructions ( alpaca-52k + dolly-15k ) into 51 languages using Google Translate API. The translated instructions are then fed to ChatGPT ( gpt-3.5-turbo ) to obtain its natural responses, resulting in 3.4M instruction-response pairs in 52 languages (52 languages x 67k instances = 3.4M instances).

Languages

We follow 52 languages of mBART-50 , with details can be found below .

B. Dataset Structure

Data Fields

The data fields are as follows:

  • instruction : describes the task the model should perform. Each of the 67K instructions is unique.
  • input : optional context or input for the task. For example, when the instruction is "Summarize the following article", the input is the article. Around 40% of the examples have an input.
  • output : the answer to the instruction as generated by gpt-3.5-turbo .
  • id : the unique index for each data piece in a specific language. The same id across splits corresponds to the same original English data.

Data Instances

An example in en looks as follows.

{
"instruction": "What is the main idea of this text?", 
"input": "The robotics industry has become one of the fastest growing sectors in today’s economy. Robotics technology is being applied in various industries and applications, such as manufacturing, healthcare, and defense.", 
"output": "The main idea of the text is that the robotics industry is rapidly growing and its technology is being used in many different industries and applications.", 
"id": "alpaca-107"
}

An example in ar looks as follows.

{
"instruction": "قم بالوصول إلى إعدادات المتصفح الافتراضية على جهاز الكمبيوتر.", 
"input": "",  
"output": "لا يمكن لي كمساعد افتراضي القيام بهذه المهمة كونها تحتاج إلى الوصول إلى الجهاز المادي. يرجى الرجوع إلى دليل المستخدم للكمبيوتر الخاص بك أو البحث على الإنترنت لمعرفة كيفية الوصول إلى إعدادات المتصفح الافتراضية على جهاز الكمبيوتر الخاص بك.",
"id": "alpaca-41997"
}

An example in id looks as follows.

{
"instruction": "Berapa banyak angka penting yang harus digunakan untuk nilai 0,000187?", 
"input": "",  
"output": "Ada tiga angka penting dalam nilai 0,000187.",
"id": "alpaca-40382"
}

An example in de looks as follows.

{
"instruction": "Komponieren Sie ein Musikstück mit einem Titel, der Trauer bezeichnet.", 
"input": "", 
"output": ""Schmerzendes Herz"\n\nDas Stück beginnt mit einer langsamen, traurigen Melodie, gespielt von einem Solo-Violoncello. Die Harmonien, die von einem Klavier begleitet werden, sind durcheinander und dissonant, um ein Gefühl von Schmerz und Trauer zu vermitteln.\n\nIm Laufe des Stücks treten andere Instrumente ein, wie Streicher, Flöte und Oboe, um die Melodie zu unterstützen und zu intensivieren. Der Rhythmus bleibt langsam und schwerfällig, um die Schwere des Schmerzes darzustellen.\n\nIn der Mitte des Stücks gibt es einen Höhepunkt, wo die Melodie und Harmonien sich verdichten und ein Gefühl von Überwältigung erzeugen. Der Solo-Violoncello spielt eine traurige und schnelle Phrase, die die Verzweiflung und Isolation des Protagonisten ausdrückt.\n\nDas Stück endet mit einer Wiederholung der ersten Melodie, gespielt von einem Solo-Klavier, während die anderen Instrumente auslaufen. Die Harmonien werden jedoch weicher und sanfter, um einen Hauch von Hoffnung zu vermitteln, dass der Schmerz langsam vergeht.",
"id": "alpaca-33619"
}

An example in fr looks as follows.

{
"instruction": "Les Ceanothus ont-ils des racines peu profondes ?", 
"input": "",
"output": "En tant qu'IA, je ne peux pas répondre avec une certitude absolue, mais voici des informations générales sur les racines de Ceanothus : \n\nLes Ceanothus ont tendance à avoir un système racinaire étendu et peu profond. Les racines peuvent s'étaler jusqu'à 4 à 6 fois la taille de la plante et s'étendre à environ 60 à 90 cm de profondeur. Cependant, cela peut varier en fonction du type de sol et de la variété de Ceanothus. Certaines variétés de Ceanothus pourraient avoir des racines plus profondes si les conditions s'y prêtent.",
"id": "dolly-10940"
}

Data in 52 Languages

No Languages Code Train
1 Afrikaans af_ZA 67017
2 Arabic ar_AR 67017
3 Azerbaijani az_AZ 67017
4 Bengali bn_IN 67017
5 Czech cs_CZ 67017
6 German de_DE 67017
7 English en_XX 67017
8 Spanish es_XX 67017
9 Estonian et_EE 67017
10 Persian fa_IR 67017
11 Finnish fi_FI 67017
12 French fr_XX 67017
13 Galician gl_ES 67017
14 Gujarati gu_IN 67017
15 Hebrew he_IL 67017
16 Hindi hi_IN 67017
17 Croatian hr_HR 67017
18 Indonesian id_ID 67017
19 Italian it_IT 67017
20 Japanese ja_XX 67017
21 Georgian ka_GE 67017
22 Kazakh kk_KZ 67017
23 Khmer km_KH 67017
24 Korean ko_KR 67017
25 Lithuanian lt_LT 67017
26 Latvian lv_LV 67017
27 Macedonian mk_MK 67017
28 Malayalam ml_IN 67017
29 Mongolian mn_MN 67017
30 Marathi mr_IN 67017
31 Burmese my_MM 67017
32 Nepali ne_NP 67017
33 Dutch nl_XX 67017
34 Polish pl_PL 67017
35 Pashto ps_AF 67017
36 Portuguese pt_XX 67017
37 Romanian ro_RO 67017
38 Russian ru_RU 67017
39 Sinhala si_LK 67017
40 Slovene sl_SI 67017
41 Swedish sv_SE 67017
42 Swahili sw_KE 67017
43 Tamil ta_IN 67017
44 Telugu te_IN 67017
45 Thai th_TH 67017
46 Tagalog tl_XX 67017
47 Turkish tr_TR 67017
48 Ukrainian uk_UA 67017
49 Urdu ur_PK 67017
50 Vietnamese vi_VN 67017
51 Xhosa xh_ZA 67017
52 Chinese zh_CN 67017

C. Dataset Creation

  • English Instructions: The English instuctions are obtained from alpaca-53k , and dolly-15k .
  • Instruction Translation: The instructions (and inputs) are translated into 51 languages using Google Translation API (conducted on April 2023).
  • Output Generation: We generate output from gpt-3.5-turbo for each language (conducted on April 2023).
  • D. Considerations for Using the Data

    Social Impact of Dataset

    NLP for everyone: this dataset helps to democratize the cutting-edge instruction-following models in 52 languages. This dataset also allows the first experiment on the multilingual LoRA-based LLaMA model.

    Discussion of Biases

    (1) Translation bias; (2) Potential English-culture bias in the translated dataset.

    Other Known Limitations

    The Bactrian-X data is generated by a language model ( gpt-3.5-turbo ) and inevitably contains some errors or biases. We encourage users to use this data with caution and propose new methods to filter or improve the imperfections.

    E. Additional Information

    Dataset Curators

    Haonan Li and Fajri Koto

    Licensing Information

    The dataset is available under the Creative Commons NonCommercial (CC BY-NC 4.0) .

    Citation Information

    @misc{li2023bactrianx,
          title={Bactrian-X : A Multilingual Replicable Instruction-Following Model with Low-Rank Adaptation}, 
          author={Haonan Li and Fajri Koto and Minghao Wu and Alham Fikri Aji and Timothy Baldwin},
          year={2023},
          eprint={2305.15011},
          archivePrefix={arXiv},
          primaryClass={cs.CL}
    }
    

    Contributions

    Thanks to @haonan-li , @fajri91 for adding this dataset.