数据集:
wikipedia
Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump ( https://dumps.wikimedia.org/ ) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).
The articles are parsed using the mwparserfromhell tool.
To load this dataset you need to install Apache Beam and mwparserfromhell first:
pip install apache_beam mwparserfromhell
Then, you can load any subset of Wikipedia per language and per date this way:
from datasets import load_dataset load_dataset("wikipedia", language="sw", date="20220120", beam_runner=...)
where you can pass as beam_runner any Apache Beam supported runner for (distributed) data processing (see here ). Pass "DirectRunner" to run it on your machine.
You can find the full list of languages and dates here .
Some subsets of Wikipedia have already been processed by HuggingFace, and you can load them just with:
from datasets import load_dataset load_dataset("wikipedia", "20220301.en")
The list of pre-processed subsets is:
The dataset is generally used for Language Modeling.
You can find the list of languages here .
An example looks as follows:
{'id': '1', 'url': 'https://simple.wikipedia.org/wiki/April', 'title': 'April', 'text': 'April is the fourth month...' }
Some subsets of Wikipedia have already been processed by HuggingFace, as you can see below:
20220301.deThe data fields are the same among all configurations:
Here are the number of examples for several configurations:
name | train |
---|---|
20220301.de | 2665357 |
20220301.en | 6458670 |
20220301.fr | 2402095 |
20220301.frr | 15199 |
20220301.it | 1743035 |
20220301.simple | 205328 |
Most of Wikipedia's text and many of its images are co-licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License (CC BY-SA) and the GNU Free Documentation License (GFDL) (unversioned, with no invariant sections, front-cover texts, or back-cover texts).
Some text has been imported only under CC BY-SA and CC BY-SA-compatible license and cannot be reused under GFDL; such text will be identified on the page footer, in the page history, or on the discussion page of the article that utilizes the text.
@ONLINE{wikidump, author = "Wikimedia Foundation", title = "Wikimedia Downloads", url = "https://dumps.wikimedia.org" }
Thanks to @lewtun , @mariamabarham , @thomwolf , @lhoestq , @patrickvonplaten for adding this dataset.