数据集:

mosaicml/instruct-v3

语言:

en
中文

MosaicML Instruct V3

This is an aggregate dataset, comprised of Dolly HHRLHF (derived from the Databricks Dolly-15k and the Anthropic Helpful and Harmless (HH-RLHF) datasets), combined with Competition Math , Duorc , CoT GSM8k , Qasper , Quality , Summ Screen FD and Spider . The intention was to create a permissively-licensed instruction-following dataset with a large number of longform samples.

Data Processing

Some data was transformed during the creation of this dataset. This involved: formatting the data into the Alpaca format, filtering for length, filtering for duplicates, adding instructions (for summarization and QA datasets), and making the instructions more like human input (transforming case, adding typos, etc).

Data Mix

Data Source Number of Samples Proportion (By Count of Samples) Number of Tokens in Source Proportion (By Count of Tokens)
competition_math 4,995 8.89% 1.6 M 3.66%
cot_gsm8k 4,995 8.89% 3.36 M 7.67%
dialogsum 400 0.71% 0.1 M 0.23%
dolly_hhrlhf 34,333 61.13% 5.89 M 13.43%
duorc 4,986 8.88% 7.8 M 17.80%
qasper 1,998 3.56% 8.72 M 19.90%
quality 1,963 3.49% 11.29 M 25.78%
scrolls/summ_screen_fd 1,498 2.67% 4.97 M 11.33%
spider 999 1.78% 0.089 M 0.20%

License/Attribution

This dataset was developed at MosaicML ( https://www.mosaicml.com ) and its use is subject to the CC BY-SA 3.0 license.

Certain categories of material in the dataset include materials from the following sources, licensed under the CC BY-SA 3.0 license:

Wikipedia (various pages) - https://www.wikipedia.org/ Copyright © Wikipedia editors and contributors.

Dolly — Databricks ( https://www.databricks.com ) Copyright © Databricks