数据集:
QingyiSi/Alpaca-CoT
This repository will continuously collect various instruction tuning datasets. And we standardize different datasets into the same format, which can be directly loaded by the code of Alpaca model.
We also have conducted empirical study on various instruction-tuning datasets based on the Alpaca model, as shown in https://github.com/PhoebusSi/alpaca-CoT .
If you think this dataset collection is helpful to you, please like this dataset and star our github project !
You are in a warm welcome to provide us with any non-collected instruction-tuning datasets (or their sources). We will uniformly format them, train Alpaca model with these datasets and open source the model checkpoints.
Welcome to join us and become a contributor to this project! If you want to share some datasets, adjust the data in the following format:
example.json [ {"instruction": instruction string, "input": input string, # (may be empty) "output": output string} ]
Folder should be like this:
Alpaca-CoT | |----example | | | |----example.json | | | ----example_context.json ...
Create a new pull request in Community and publish your branch when you are ready. We will merge it as soon as we can.
All data in this folder is formatted into the same templates, where each sample is as follows:
[ {"instruction": instruction string, "input": input string, # (may be empty) "output": output string} ]
This dataset is published by Stanford Alpaca . It contains 52K English instruction-following samples obtained by Self-Instruction techniques.
alpaca_data_cleaned.jsonThis dataset is obtained here . It is a revised version of alpaca_data.json by stripping of various tokenization artifacts.
This dataset is published by Instruction-Tuning-with-GPT-4 . It contains 52K English instruction-following samples generated by GPT-4 using Alpaca prompts for fine-tuning LLMs.
alpaca_gpt4_data_zh.jsonThis dataset is generated by GPT-4 using Chinese prompts translated from Alpaca by ChatGPT.
This dataset is obtained by formatting the combination of 9 CoT datasets published by FLAN . It contains 9 CoT tasks involving 74771 samples.
CoT_CN_data.jsonThis dataset is obtained by tranlating CoT_data.json into Chinese, using Google Translate(en2cn).
formatted_cot_data folderThis folder contains the formatted English data for each CoT dataset.
formatted_cot_data folderThis folder contains the formatted Chinese data for each CoT dataset.
This dataset is published by codealpaca . It contains code generation task involving 20022 samples.
This dataset is collected from here . It contains 68912 financial related instructions in English.
his dataset is collected from here . It contains 1649398 chinese instructions in 23 nlp tasks.
This dataset is collected from here . It contains 806199 en instructions in code, storys and dialogs tasks.
gpt4all_without_p3.jsongpt4all without Bigscience/P3, contains 437605 samples.
This dataset is collected from here . It contains 29013 en instructions generated by GPT-4, General-Instruct - Roleplay-Instruct - Code-Instruct - and Toolformer.
This dataset is collected from here . It contains 534610 en instructions generated by text-davinci-003 upon 175 tasks from the Alpaca model by providing rewrites of seed tasks in different languages and adding new tasks specifically designed for English grammar analysis, natural language understanding, cross-lingual self-awareness, and explicit content recognition.
Guanaco_additional_Dataset.jsonA new additional larger dataset for different languages.
This dataset is collected from here . It contains 37175 en/zh instructions generated by ChatGPT and human.
HC3_ChatGPT_deduplication.json/HC3_Human_deduplication.jsonHC3 dataset without deduplication instructions.
The two datasets are obtained here . It contains 52191 English and 51504 Chinese instructions, which are collected from Twitter, where users tend to share their interesting prompts of mostly generation, open QA, and mind-storm types. (Colossal AI used these datasets to train the ColossalChat model.)
The two datasets are obtained here . It contains 888969 English instructions, which are caugmentation performed using the advanced NLP tools provided by AllenAI.
This dataset is obtained here . It contains 5040134 instructions, which are collected from diverse nlp tasks
This dataset is obtained here . It contains 165681 English instructions, which are produuced by GPT-3 rewrites questions and humans feedback
This dataset is obtained here . It contains 78883588 instructions, which are collected by prompts & datasets across 46 of languages & 16 NLP tasks
all datasets of Chinese instruction collection
This dataset is the combination of English alpaca_data.json and Chinese belle_data_cn.json .
alcapa_plus_cot_data.jsonThis dataset is the combination of English alpaca_data.json and CoT CoT_data.json .
alcapa_plus_belle_cot_data.jsonThis dataset is the combination of English alpaca_data.json , Chinese belle_data_cn.json and CoT CoT_data.json .
Please cite the repo if you use the data collection, code, and experimental findings in this repo.
@misc{alpaca-cot, author = {Qingyi Si, Zheng Lin }, school = {Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China}, title = {Alpaca-CoT: An Instruction Fine-Tuning Platform with Instruction Data Collection and Unified Large Language Models Interface}, year = {2023}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/PhoebusSi/alpaca-CoT}}, }
Cite the original Stanford Alpaca, BELLE and FLAN papers as well, please.