humarin/chatgpt-paraphrases | ATYUN.COM 官网-人工智能教程资讯全方位服务平台

数据集:

humarin/chatgpt-paraphrases

任务:

文生文

语言:

大小:

100K<n<1M

许可:

openrail

数据集介绍文件清单

中文

This is a dataset of paraphrases created by ChatGPT.

Model based on this dataset is avaible: model

We used this prompt to generate paraphrases

Generate 5 similar paraphrases for this question, show it like a numbered list without commentaries: {text}

This dataset is based on the Quora paraphrase question , texts from the SQUAD 2.0 and the CNN news dataset .

We generated 5 paraphrases for each sample, totally this dataset has about 420k data rows. You can make 30 rows from a row from each sample. In this way you can make 12.6 millions train pairs (420k rows with 5 paraphrases -> 6x5x420000 = 12.6 millions of bidirected or 6x5x420000/2 = 6.3 millions of unique pairs).

We used

247138 questions from the Quora dataset
91983 texts from the Squad 2.0 dataset
80076 texts from the CNN news dataset

Structure of the dataset

text column - an original sentence or question from the datasets
paraphrases - a list of 5 paraphrases
category - question / sentence
source - quora / squad_2 / cnn_news

Legal disclaimer

Data is based on OpenAI’s gpt-3.5-turbo, whose terms of use prohibit developing models that compete with OpenAI. So if you use this dataset to train a model, don't compete with OpenAI.

BibTeX entry and citation info

@inproceedings{chatgpt_paraphrases_dataset,
  author={Vladimir Vorobev, Maxim Kuznetsov},
  title={ChatGPT paraphrases dataset},
  year={2023}
}

作者:

humarin

数据集大小:

252.67 MB