Dataset for training models to classify human written vs GPT/ChatGPT generated text. This dataset contains Wikipedia introductions and GPT (Curie) generated introductions for 150k topics.
Prompt used for generating text
200 word wikipedia style introduction on '{title}' {starter_text}
where title is the title for the wikipedia page, and starter_text is the first seven words of the wikipedia introduction. Here's an example of prompt used to generate the introduction paragraph for 'Secretory protein' -
'200 word wikipedia style introduction on Secretory protein
A secretory protein is any protein, whether'
Configuration used for GPT model
model="text-curie-001", prompt=prompt, temperature=0.7, max_tokens=300, top_p=1, frequency_penalty=0.4, presence_penalty=0.1
Schema for the dataset
Column | Datatype | Description |
---|---|---|
id | int64 | ID |
url | string | Wikipedia URL |
title | string | Title |
wiki_intro | string | Introduction paragraph from wikipedia |
generated_intro | string | Introduction generated by GPT (Curie) model |
title_len | int64 | Number of words in title |
wiki_intro_len | int64 | Number of words in wiki_intro |
generated_intro_len | int64 | Number of words in generated_intro |
prompt | string | Prompt used to generate intro |
generated_text | string | Text continued after the prompt |
prompt_tokens | int64 | Number of tokens in the prompt |
generated_text_tokens | int64 | Number of tokens in generated text |
Code to create this dataset can be found on GitHub
@misc {aaditya_bhat_2023, author = { {Aaditya Bhat} }, title = { GPT-wiki-intro (Revision 0e458f5) }, year = 2023, url = { https://huggingface.co/datasets/aadityaubhat/GPT-wiki-intro }, doi = { 10.57967/hf/0326 }, publisher = { Hugging Face } }