Dataset Card for "dolly_hhrlhf"

This dataset is a combination of Databrick's dolly-15k dataset and a filtered subset of Anthropic's HH-RLHF . It also includes a test split, which was missing in the original dolly set. That test set is composed of 200 randomly selected samples from dolly + 4,929 of the test set samples from HH-RLHF which made it through the filtering process. The train set contains 59,310 samples; 15,014 - 200 = 14,814 from Dolly, and the remaining 44,496 from HH-RLHF.

It is slightly larger than Alpaca, and in our experience of slightly higher quality, but is usable for commercial purposes so long as you follow the terms of the license.

Filtering process

As mentioned, the HH-RLHF data in this dataset is filtered. Specifically, we take the first turn of the convesation, then remove any samples where the assistant:

uses the word "human", "thank", or "sorry"
asks a question
uses a first person pronoun

This leaves samples which look like instruction-following, as opposed to conversation.

License/Attribution

This dataset was developed at MosaicML ( https://www.mosaicml.com ) and its use is subject to the CC BY-SA 3.0 license.

Certain categories of material in the dataset include materials from the following sources, licensed under the CC BY-SA 3.0 license:

作者:

mosaicml

数据集大小:

23.73 MB