数据集:
HuggingFaceH4/databricks_dolly_15k
预印本库:
arxiv:2203.02155许可:
cc-by-3.0databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.
This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported License .
Supported Tasks:
Languages: English Version: 1.0
Owner: Databricks, Inc.
databricks-dolly-15k is a corpus of more than 15,000 records generated by thousands of Databricks employees to enable large language models to exhibit the magical interactivity of ChatGPT. Databricks employees were invited to create prompt / response pairs in each of eight different instruction categories, including the seven outlined in the InstructGPT paper, as well as an open-ended free-form category. The contributors were instructed to avoid using information from any source on the web with the exception of Wikipedia (for particular subsets of instruction categories), and explicitly instructed to avoid using generative AI in formulating instructions or responses. Examples of each behavior were provided to motivate the types of questions and instructions appropriate to each category.
Halfway through the data generation process, contributors were given the option of answering questions posed by other contributors. They were asked to rephrase the original question and only select questions they could be reasonably expected to answer correctly.
For certain categories contributors were asked to provide reference texts copied from Wikipedia. Reference text (indicated by the context field in the actual dataset) may contain bracketed Wikipedia citation numbers (e.g. [42] ) which we recommend users remove for downstream applications.
While immediately valuable for instruction fine tuning large language models, as a corpus of human-generated instruction prompts, this dataset also presents a valuable opportunity for synthetic data generation in the methods outlined in the Self-Instruct paper. For example, contributor--generated prompts could be submitted as few-shot examples to a large open language model to generate a corpus of millions of examples of instructions in each of the respective InstructGPT categories.
Likewise, both the instructions and responses present fertile ground for data augmentation. A paraphrasing model might be used to restate each prompt or short responses, with the resulting text associated to the respective ground-truth sample. Such an approach might provide a form of regularization on the dataset that could allow for more robust instruction-following behavior in models derived from these synthetic datasets.
As part of our continuing commitment to open source, Databricks developed what is, to the best of our knowledge, the first open source, human-generated instruction corpus specifically designed to enable large language models to exhibit the magical interactivity of ChatGPT. Unlike other datasets that are limited to non-commercial use, this dataset can be used, modified, and extended for any purpose, including academic or commercial applications.
To create a record, employees were given a brief description of the annotation task as well as examples of the types of prompts typical of each annotation task. Guidelines were succinct by design so as to encourage a high task completion rate, possibly at the cost of rigorous compliance to an annotation rubric that concretely and reliably operationalizes the specific task. Caveat emptor.
The annotation guidelines for each of the categories are as follows:
This dataset contains public information (e.g., some information from Wikipedia). To our knowledge, there are no private person’s personal identifiers or sensitive information.
American English
Copyright (2023) Databricks, Inc. This dataset was developed at Databricks ( https://www.databricks.com ) and its use is subject to the CC BY-SA 3.0 license.
Certain categories of material in the dataset include materials from the following sources, licensed under the CC BY-SA 3.0 license:
Wikipedia (various pages) - https://www.wikipedia.org/ Copyright © Wikipedia editors and contributors.