Dataset Card for "LaMini-Instruction"

Minghao Wu , Abdul Waheed , Chiyu Zhang , Muhammad Abdul-Mageed , Alham Fikri Aji ,

Dataset Description

We distill the knowledge from large language models by performing sentence/offline distillation (Kim and Rush, 2016). We generate a total of 2.58M pairs of instructions and responses using gpt-3.5-turbo based on several existing resources of prompts, including self-instruct (Wang et al., 2022), P3 (Sanh et al., 2022), FLAN (Longpre et al., 2023) and Alpaca (Taori et al., 2023). More information about the process for generating our instruction dataset, please refer to our paper .

We use LaMini-Instruction to fine-tune various small-sized language models, resulting in LaMini-LM model series .

Each sample contain the following two fields:

instruction : string , describes the task that the model is expected to perform.
response : string , the response to the instruction as generated by gpt-3.5-turbo .
instruction_source : string , the source of a instruction. The possible values are {'alpaca', 'generated_flan', 'generated_p3', 'original_flan', 'original_p3', 'self_instruct_with_topic', 'self_instruct_without_topic'} .

Two examples in our LaMini-instruction data:

{
    "instruction": "What is the capital of Australia?\n\nIs Canberra the only city chosen to serve as a capital due to a compromise between Sydney and Melbourne?", 
    "response": "The capital of Australia is Canberra. Yes, Canberra was chosen as the capital of Australia in 1908 due to a compromise between Sydney and Melbourne, as neither city wanted the other to become the capital.",
    "instruction_source": "generated_flan"
}

{
    "instruction": "Create a poem that has the word \"ocean\" in every other line.", 
    "response": "The ocean's depth is a mystery,\nIts waves crashing with energy.\nSalt air fills the coastal scene,\nThe ocean's blue hues pure and clean.\n\nSeagulls cry out a melancholy song,\nThe ocean's rhythm steady and strong.\nSunsets paint the water gold,\nThe ocean's secrets forever untold.",
    "instruction_source": "self_instruct_without_topic"
}

Considerations for Using the Data

As the output is distilled from ChatGPT, this data contains errors and biases produced by ChatGPT. Models trained with this dataset will inherit the errors and biases. We encourage users to use this data with caution and propose new methods to filter or improve the imperfections.

Licensing Information

The dataset is available under the Creative Commons NonCommercial (CC BY-NC 4.0) .

Citation Information

Please cite us if you use our data or models.

@article{lamini-lm,
  author       = {Minghao Wu and
                  Abdul Waheed and
                  Chiyu Zhang and
                  Muhammad Abdul-Mageed and
                  Alham Fikri Aji
                  },
  title        = {LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions},
  journal      = {CoRR},
  volume       = {abs/2304.14402},
  year         = {2023},
  url          = {https://arxiv.org/abs/2304.14402},
  eprinttype   = {arXiv},
  eprint       = {2304.14402}
}

作者:

MBZUAI

数据集大小:

671.67 MB