中文

Dataset Card for Wikipedia

This repo is a fork of the original Hugging Face Wikipedia repo here . The difference is that this fork does away with the need for apache-beam , and this fork is very fast if you have a lot of CPUs on your machine. It will use all CPUs available to create a clean Wikipedia pretraining dataset. It takes less than an hour to process all of English wikipedia on a GCP n1-standard-96. This fork is also used in the OLM Project to pull and process up-to-date wikipedia snapshots.

Dataset Summary

Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump ( https://dumps.wikimedia.org/ ) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).

The articles are parsed using the mwparserfromhell tool, and we use multiprocess for parallelization.

To load this dataset you need to install these first:

pip install mwparserfromhell==0.6.4 multiprocess==0.70.13

Then, you can load any subset of Wikipedia per language and per date this way:

from datasets import load_dataset

load_dataset("olm/wikipedia", language="en", date="20220920")

You can find the full list of languages and dates here .

Supported Tasks and Leaderboards

The dataset is generally used for Language Modeling.

Languages

You can find the list of languages here .

Dataset Structure

Data Instances

An example looks as follows:

{'id': '1',
 'url': 'https://simple.wikipedia.org/wiki/April',
 'title': 'April',
 'text': 'April is the fourth month...'
}

Data Fields

The data fields are the same among all configurations:

  • id ( str ): ID of the article.
  • url ( str ): URL of the article.
  • title ( str ): Title of the article.
  • text ( str ): Text content of the article.

Curation Rationale

More Information Needed

Source Data

Initial Data Collection and Normalization

More Information Needed

Who are the source language producers?

More Information Needed

Annotations

Annotation process

More Information Needed

Who are the annotators?

More Information Needed

Personal and Sensitive Information

More Information Needed

Considerations for Using the Data

Social Impact of Dataset

More Information Needed

Discussion of Biases

More Information Needed

Other Known Limitations

More Information Needed

Additional Information

Dataset Curators

More Information Needed

Licensing Information

Most of Wikipedia's text and many of its images are co-licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License (CC BY-SA) and the GNU Free Documentation License (GFDL) (unversioned, with no invariant sections, front-cover texts, or back-cover texts).

Some text has been imported only under CC BY-SA and CC BY-SA-compatible license and cannot be reused under GFDL; such text will be identified on the page footer, in the page history, or on the discussion page of the article that utilizes the text.

Citation Information

@ONLINE{wikidump,
    author = "Wikimedia Foundation",
    title  = "Wikimedia Downloads",
    url    = "https://dumps.wikimedia.org"
}