数据集:
openwebtext
An open-source replication of the WebText dataset from OpenAI, that was used to train GPT-2.
This distribution was created by Aaron Gokaslan and Vanya Cohen of Brown University.
An example of 'train' looks as follows.
This example was too long and was cropped: { "text": "\"A magazine supplement with an image of Adolf Hitler and the title 'The Unreadable Book' is pictured in Berlin. No law bans “Mei..." }
The data fields are the same among all splits.
plain_textname | train |
---|---|
plain_text | 8013769 |
The authors started by extracting all Reddit post urls from the Reddit submissions dataset. These links were deduplicated, filtered to exclude non-html content, and then shuffled randomly. The links were then distributed to several machines in parallel for download, and all web pages were extracted using the newspaper python package. Using Facebook FastText, non-English web pages were filtered out.
Subsequently, near-duplicate documents were identified using local-sensitivity hashing (LSH). Documents were hashed into sets of 5-grams and all documents that had a similarity threshold of greater than 0.5 were removed. The the remaining documents were tokenized, and documents with fewer than 128 tokens were removed. This left 38GB of text data (40GB using SI units) from 8,013,769 documents.
Who are the source language producers?The dataset doesn't contain annotations.
These data are released under this licensing scheme from the original authors ( source ):
We do not own any of the text from which these data has been extracted. We license the actual packaging of these parallel data under the [Creative Commons CC0 license (“no rights reserved”)](https://creativecommons.org/share-your-work/public-domain/cc0/)Notice policy
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
Clearly identify the copyrighted work claimed to be infringed.
Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
And contact us at the following email address: openwebtext at gmail.com and datasets at huggingface.co
Take down policyThe original authors will comply to legitimate requests by removing the affected sources from the next release of the corpus. Hugging Face will also update this repository accordingly.
@misc{Gokaslan2019OpenWeb, title={OpenWebText Corpus}, author={Aaron Gokaslan*, Vanya Cohen*, Ellie Pavlick, Stefanie Tellex}, howpublished{\url{http://Skylion007.github.io/OpenWebTextCorpus}}, year={2019} }
Thanks to @richarddwang for adding this dataset.