GPT Wiki Intro

Overview

Dataset for training models to classify human written vs GPT/ChatGPT generated text. This dataset contains Wikipedia introductions and GPT (Curie) generated introductions for 150k topics.

Prompt used for generating text

200 word wikipedia style introduction on '{title}'
{starter_text}

where title is the title for the wikipedia page, and starter_text is the first seven words of the wikipedia introduction. Here's an example of prompt used to generate the introduction paragraph for 'Secretory protein' -

'200 word wikipedia style introduction on Secretory protein

A secretory protein is any protein, whether'

Configuration used for GPT model

model="text-curie-001",
prompt=prompt,
temperature=0.7,
max_tokens=300,
top_p=1,
frequency_penalty=0.4,
presence_penalty=0.1

Schema for the dataset

Column	Datatype	Description
id	int64	ID
url	string	Wikipedia URL
title	string	Title
wiki_intro	string	Introduction paragraph from wikipedia
generated_intro	string	Introduction generated by GPT (Curie) model
title_len	int64	Number of words in title
wiki_intro_len	int64	Number of words in wiki_intro
generated_intro_len	int64	Number of words in generated_intro
prompt	string	Prompt used to generate intro
generated_text	string	Text continued after the prompt
prompt_tokens	int64	Number of tokens in the prompt
generated_text_tokens	int64	Number of tokens in generated text

Credits

wikipedia dataset

Code

Code to create this dataset can be found on GitHub

Citation

@misc {aaditya_bhat_2023,
    author       = { {Aaditya Bhat} },
    title        = { GPT-wiki-intro (Revision 0e458f5) },
    year         = 2023,
    url          = { https://huggingface.co/datasets/aadityaubhat/GPT-wiki-intro },
    doi          = { 10.57967/hf/0326 },
    publisher    = { Hugging Face }
}

作者:

aadityaubhat

数据集大小:

121.38 MB