数据集:
mosaicml/instruct-v3
语言:
enThis is an aggregate dataset, comprised of Dolly HHRLHF (derived from the Databricks Dolly-15k and the Anthropic Helpful and Harmless (HH-RLHF) datasets), combined with Competition Math , Duorc , CoT GSM8k , Qasper , Quality , Summ Screen FD and Spider . The intention was to create a permissively-licensed instruction-following dataset with a large number of longform samples.
Some data was transformed during the creation of this dataset. This involved: formatting the data into the Alpaca format, filtering for length, filtering for duplicates, adding instructions (for summarization and QA datasets), and making the instructions more like human input (transforming case, adding typos, etc).
Data Source | Number of Samples | Proportion (By Count of Samples) | Number of Tokens in Source | Proportion (By Count of Tokens) |
---|---|---|---|---|
competition_math | 4,995 | 8.89% | 1.6 M | 3.66% |
cot_gsm8k | 4,995 | 8.89% | 3.36 M | 7.67% |
dialogsum | 400 | 0.71% | 0.1 M | 0.23% |
dolly_hhrlhf | 34,333 | 61.13% | 5.89 M | 13.43% |
duorc | 4,986 | 8.88% | 7.8 M | 17.80% |
qasper | 1,998 | 3.56% | 8.72 M | 19.90% |
quality | 1,963 | 3.49% | 11.29 M | 25.78% |
scrolls/summ_screen_fd | 1,498 | 2.67% | 4.97 M | 11.33% |
spider | 999 | 1.78% | 0.089 M | 0.20% |
This dataset was developed at MosaicML ( https://www.mosaicml.com ) and its use is subject to the CC BY-SA 3.0 license.
Certain categories of material in the dataset include materials from the following sources, licensed under the CC BY-SA 3.0 license:
Wikipedia (various pages) - https://www.wikipedia.org/ Copyright © Wikipedia editors and contributors.
Dolly — Databricks ( https://www.databricks.com ) Copyright © Databricks