数据集:

joelito/EU_Wikipedias

任务:

填充掩码

语言:

计算机处理:

multilingual

大小:

10M<n<100M

语言创建人:

found

批注创建人:

other

源数据集:

original

许可:

cc-by-4.0

数据集介绍文件清单

中文

Dataset Card for EUWikipedias: A dataset of Wikipedias in the EU languages

Dataset Summary

Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump ( https://dumps.wikimedia.org/ ) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).

Supported Tasks and Leaderboards

The dataset supports the tasks of fill-mask.

Languages

The following languages are supported: bg, cs, da, de, el, en, es, et, fi, fr, ga, hr, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv

Dataset Structure

It is structured in the following format: {date}/{language}_{shard}.jsonl.xz At the moment only the date '20221120' is supported.

Use the dataset like this:

from datasets import load_dataset

dataset = load_dataset('joelito/EU_Wikipedias', date="20221120", language="de", split='train', streaming=True)

Data Instances

The file format is jsonl.xz and there is one split available ( train ).

Source	Size (MB)	Words	Documents	Words/Document
20221120.all	86034	9506846949	26481379	359
20221120.bg	1261	88138772	285876	308
20221120.cs	1904	189580185	513851	368
20221120.da	679	74546410	286864	259
20221120.de	11761	1191919523	2740891	434
20221120.el	1531	103504078	215046	481
20221120.en	26685	3192209334	6575634	485
20221120.es	6636	801322400	1583597	506
20221120.et	538	48618507	231609	209
20221120.fi	1391	115779646	542134	213
20221120.fr	9703	1140823165	2472002	461
20221120.ga	72	8025297	57808	138
20221120.hr	555	58853753	198746	296
20221120.hu	1855	167732810	515777	325
20221120.it	5999	687745355	1782242	385
20221120.lt	409	37572513	203233	184
20221120.lv	269	25091547	116740	214
20221120.mt	29	2867779	5030	570
20221120.nl	3208	355031186	2107071	168
20221120.pl	3608	349900622	1543442	226
20221120.pt	3315	389786026	1095808	355
20221120.ro	1017	111455336	434935	256
20221120.sk	506	49612232	238439	208
20221120.sl	543	58858041	178472	329
20221120.sv	2560	257872432	2556132	100

Data Fields

[More Information Needed]

Data Splits

[More Information Needed]

Dataset Creation

This dataset has been created by downloading the wikipedias using olm/wikipedia for the 24 EU languages. For more information about the creation of the dataset please refer to prepare_wikipedias.py

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

[More Information Needed]

Citation Information

TODO add citation

Contributions

Thanks to @JoelNiklaus for adding this dataset.

作者:

joelito

数据集大小:

16.81 GB