数据集:
olm/wikipedia
This repo is a fork of the original Hugging Face Wikipedia repo here . The difference is that this fork does away with the need for apache-beam , and this fork is very fast if you have a lot of CPUs on your machine. It will use all CPUs available to create a clean Wikipedia pretraining dataset. It takes less than an hour to process all of English wikipedia on a GCP n1-standard-96. This fork is also used in the OLM Project to pull and process up-to-date wikipedia snapshots.
Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump ( https://dumps.wikimedia.org/ ) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).
The articles are parsed using the mwparserfromhell tool, and we use multiprocess for parallelization.
To load this dataset you need to install these first:
pip install mwparserfromhell==0.6.4 multiprocess==0.70.13
Then, you can load any subset of Wikipedia per language and per date this way:
from datasets import load_dataset load_dataset("olm/wikipedia", language="en", date="20220920")
You can find the full list of languages and dates here .
The dataset is generally used for Language Modeling.
You can find the list of languages here .
An example looks as follows:
{'id': '1', 'url': 'https://simple.wikipedia.org/wiki/April', 'title': 'April', 'text': 'April is the fourth month...' }
The data fields are the same among all configurations:
Most of Wikipedia's text and many of its images are co-licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License (CC BY-SA) and the GNU Free Documentation License (GFDL) (unversioned, with no invariant sections, front-cover texts, or back-cover texts).
Some text has been imported only under CC BY-SA and CC BY-SA-compatible license and cannot be reused under GFDL; such text will be identified on the page footer, in the page history, or on the discussion page of the article that utilizes the text.
@ONLINE{wikidump, author = "Wikimedia Foundation", title = "Wikimedia Downloads", url = "https://dumps.wikimedia.org" }