数据集:

humarin/chatgpt-paraphrases

任务:

文生文

语言:

en

大小:

100K<n<1M

许可:

openrail
中文

This is a dataset of paraphrases created by ChatGPT.

Model based on this dataset is avaible: model

We used this prompt to generate paraphrases

Generate 5 similar paraphrases for this question, show it like a numbered list without commentaries: {text}

This dataset is based on the Quora paraphrase question , texts from the SQUAD 2.0 and the CNN news dataset .

We generated 5 paraphrases for each sample, totally this dataset has about 420k data rows. You can make 30 rows from a row from each sample. In this way you can make 12.6 millions train pairs (420k rows with 5 paraphrases -> 6x5x420000 = 12.6 millions of bidirected or 6x5x420000/2 = 6.3 millions of unique pairs).

We used

  • 247138 questions from the Quora dataset
  • 91983 texts from the Squad 2.0 dataset
  • 80076 texts from the CNN news dataset

Structure of the dataset

  • text column - an original sentence or question from the datasets
  • paraphrases - a list of 5 paraphrases
  • category - question / sentence
  • source - quora / squad_2 / cnn_news

Legal disclaimer

Data is based on OpenAI’s gpt-3.5-turbo, whose terms of use prohibit developing models that compete with OpenAI. So if you use this dataset to train a model, don't compete with OpenAI.

BibTeX entry and citation info

@inproceedings{chatgpt_paraphrases_dataset,
  author={Vladimir Vorobev, Maxim Kuznetsov},
  title={ChatGPT paraphrases dataset},
  year={2023}
}