中文

Dataset Card for Wikipedia

Dataset Summary

Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump ( https://dumps.wikimedia.org/ ) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).

The articles are parsed using the mwparserfromhell tool.

To load this dataset you need to install Apache Beam and mwparserfromhell first:

pip install apache_beam mwparserfromhell

Then, you can load any subset of Wikipedia per language and per date this way:

from datasets import load_dataset

load_dataset("wikipedia", language="sw", date="20220120", beam_runner=...)

where you can pass as beam_runner any Apache Beam supported runner for (distributed) data processing (see here ). Pass "DirectRunner" to run it on your machine.

You can find the full list of languages and dates here .

Some subsets of Wikipedia have already been processed by HuggingFace, and you can load them just with:

from datasets import load_dataset

load_dataset("wikipedia", "20220301.en")

The list of pre-processed subsets is:

  • "20220301.de"
  • "20220301.en"
  • "20220301.fr"
  • "20220301.frr"
  • "20220301.it"
  • "20220301.simple"

Supported Tasks and Leaderboards

The dataset is generally used for Language Modeling.

Languages

You can find the list of languages here .

Dataset Structure

Data Instances

An example looks as follows:

{'id': '1',
 'url': 'https://simple.wikipedia.org/wiki/April',
 'title': 'April',
 'text': 'April is the fourth month...'
}

Some subsets of Wikipedia have already been processed by HuggingFace, as you can see below:

20220301.de
  • Size of downloaded dataset files: 6.84 GB
  • Size of the generated dataset: 9.34 GB
  • Total amount of disk used: 16.18 GB
20220301.en
  • Size of downloaded dataset files: 21.60 GB
  • Size of the generated dataset: 21.26 GB
  • Total amount of disk used: 42.86 GB
20220301.fr
  • Size of downloaded dataset files: 5.87 GB
  • Size of the generated dataset: 7.73 GB
  • Total amount of disk used: 13.61 GB
20220301.frr
  • Size of downloaded dataset files: 13.04 MB
  • Size of the generated dataset: 9.57 MB
  • Total amount of disk used: 22.62 MB
20220301.it
  • Size of downloaded dataset files: 3.69 GB
  • Size of the generated dataset: 4.76 GB
  • Total amount of disk used: 8.45 GB
20220301.simple
  • Size of downloaded dataset files: 251.32 MB
  • Size of the generated dataset: 246.49 MB
  • Total amount of disk used: 497.82 MB

Data Fields

The data fields are the same among all configurations:

  • id ( str ): ID of the article.
  • url ( str ): URL of the article.
  • title ( str ): Title of the article.
  • text ( str ): Text content of the article.

Data Splits

Here are the number of examples for several configurations:

name train
20220301.de 2665357
20220301.en 6458670
20220301.fr 2402095
20220301.frr 15199
20220301.it 1743035
20220301.simple 205328

Dataset Creation

Curation Rationale

More Information Needed

Source Data

Initial Data Collection and Normalization

More Information Needed

Who are the source language producers?

More Information Needed

Annotations

Annotation process

More Information Needed

Who are the annotators?

More Information Needed

Personal and Sensitive Information

More Information Needed

Considerations for Using the Data

Social Impact of Dataset

More Information Needed

Discussion of Biases

More Information Needed

Other Known Limitations

More Information Needed

Additional Information

Dataset Curators

More Information Needed

Licensing Information

Most of Wikipedia's text and many of its images are co-licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License (CC BY-SA) and the GNU Free Documentation License (GFDL) (unversioned, with no invariant sections, front-cover texts, or back-cover texts).

Some text has been imported only under CC BY-SA and CC BY-SA-compatible license and cannot be reused under GFDL; such text will be identified on the page footer, in the page history, or on the discussion page of the article that utilizes the text.

Citation Information

@ONLINE{wikidump,
    author = "Wikimedia Foundation",
    title  = "Wikimedia Downloads",
    url    = "https://dumps.wikimedia.org"
}

Contributions

Thanks to @lewtun , @mariamabarham , @thomwolf , @lhoestq , @patrickvonplaten for adding this dataset.