MosaicML Instruct V3

This is an aggregate dataset, comprised of Dolly HHRLHF (derived from the Databricks Dolly-15k and the Anthropic Helpful and Harmless (HH-RLHF) datasets), combined with Competition Math , Duorc , CoT GSM8k , Qasper , Quality , Summ Screen FD and Spider . The intention was to create a permissively-licensed instruction-following dataset with a large number of longform samples.

Data Processing

Some data was transformed during the creation of this dataset. This involved: formatting the data into the Alpaca format, filtering for length, filtering for duplicates, adding instructions (for summarization and QA datasets), and making the instructions more like human input (transforming case, adding typos, etc).

Data Mix

Data Source	Number of Samples	Proportion (By Count of Samples)	Number of Tokens in Source	Proportion (By Count of Tokens)
competition_math	4,995	8.89%	1.6 M	3.66%
cot_gsm8k	4,995	8.89%	3.36 M	7.67%
dialogsum	400	0.71%	0.1 M	0.23%
dolly_hhrlhf	34,333	61.13%	5.89 M	13.43%
duorc	4,986	8.88%	7.8 M	17.80%
qasper	1,998	3.56%	8.72 M	19.90%
quality	1,963	3.49%	11.29 M	25.78%
scrolls/summ_screen_fd	1,498	2.67%	4.97 M	11.33%
spider	999	1.78%	0.089 M	0.20%

License/Attribution

This dataset was developed at MosaicML ( https://www.mosaicml.com ) and its use is subject to the CC BY-SA 3.0 license.

Certain categories of material in the dataset include materials from the following sources, licensed under the CC BY-SA 3.0 license:

作者:

mosaicml

数据集大小:

131.11 MB