数据集:

wikipedia

任务:

文本生成

填充掩码

子任务:

language-modeling masked-language-modeling

语言:

language:ace

计算机处理:

multilingual

大小:

n<1K 1K<n<10K 10K<n<100K

语言创建人:

crowdsourced

批注创建人:

no-annotation

源数据集:

original

许可:

cc-by-sa-3.0

gfdl

数据集介绍文件清单

中文

Dataset Card for Wikipedia

Dataset Summary

Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump ( https://dumps.wikimedia.org/ ) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).

The articles are parsed using the mwparserfromhell tool.

To load this dataset you need to install Apache Beam and mwparserfromhell first:

pip install apache_beam mwparserfromhell

Then, you can load any subset of Wikipedia per language and per date this way:

from datasets import load_dataset

load_dataset("wikipedia", language="sw", date="20220120", beam_runner=...)

where you can pass as beam_runner any Apache Beam supported runner for (distributed) data processing (see here ). Pass "DirectRunner" to run it on your machine.

You can find the full list of languages and dates here .

Some subsets of Wikipedia have already been processed by HuggingFace, and you can load them just with:

from datasets import load_dataset

load_dataset("wikipedia", "20220301.en")

The list of pre-processed subsets is:

"20220301.de"
"20220301.en"
"20220301.fr"
"20220301.frr"
"20220301.it"
"20220301.simple"

Supported Tasks and Leaderboards

The dataset is generally used for Language Modeling.

Languages

You can find the list of languages here .

Dataset Structure

Data Instances

An example looks as follows:

{'id': '1',
 'url': 'https://simple.wikipedia.org/wiki/April',
 'title': 'April',
 'text': 'April is the fourth month...'
}

Some subsets of Wikipedia have already been processed by HuggingFace, as you can see below:

20220301.de

Size of downloaded dataset files: 6.84 GB
Size of the generated dataset: 9.34 GB
Total amount of disk used: 16.18 GB

20220301.en

Size of downloaded dataset files: 21.60 GB
Size of the generated dataset: 21.26 GB
Total amount of disk used: 42.86 GB

20220301.fr

Size of downloaded dataset files: 5.87 GB
Size of the generated dataset: 7.73 GB
Total amount of disk used: 13.61 GB

20220301.frr

Size of downloaded dataset files: 13.04 MB
Size of the generated dataset: 9.57 MB
Total amount of disk used: 22.62 MB

20220301.it

Size of downloaded dataset files: 3.69 GB
Size of the generated dataset: 4.76 GB
Total amount of disk used: 8.45 GB

20220301.simple

Size of downloaded dataset files: 251.32 MB
Size of the generated dataset: 246.49 MB
Total amount of disk used: 497.82 MB

Data Fields

The data fields are the same among all configurations:

id ( str ): ID of the article.
url ( str ): URL of the article.
title ( str ): Title of the article.
text ( str ): Text content of the article.

Data Splits

Here are the number of examples for several configurations:

name	train
20220301.de	2665357
20220301.en	6458670
20220301.fr	2402095
20220301.frr	15199
20220301.it	1743035
20220301.simple	205328

Dataset Creation

Curation Rationale

More Information Needed

Source Data

Initial Data Collection and Normalization

More Information Needed

Who are the source language producers?

More Information Needed

Annotations

Annotation process

More Information Needed

Who are the annotators?

More Information Needed

Personal and Sensitive Information

More Information Needed

Considerations for Using the Data

Additional Information

Dataset Curators

More Information Needed

Licensing Information

Most of Wikipedia's text and many of its images are co-licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License (CC BY-SA) and the GNU Free Documentation License (GFDL) (unversioned, with no invariant sections, front-cover texts, or back-cover texts).

Some text has been imported only under CC BY-SA and CC BY-SA-compatible license and cannot be reused under GFDL; such text will be identified on the page footer, in the page history, or on the discussion page of the article that utilizes the text.

Citation Information

@ONLINE{wikidump,
    author = "Wikimedia Foundation",
    title  = "Wikimedia Downloads",
    url    = "https://dumps.wikimedia.org"
}

Contributions

Thanks to @lewtun , @mariamabarham , @thomwolf , @lhoestq , @patrickvonplaten for adding this dataset.

作者:

佚名

数据集大小:

81.74 KB

Dataset Card for Wikipedia

Dataset Summary

Supported Tasks and Leaderboards

Languages

Dataset Structure

Data Instances

Data Fields

Data Splits

Dataset Creation

Curation Rationale

Source Data

Annotations

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Citation Information

Contributions