数据集:
joelito/EU_Wikipedias
Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump ( https://dumps.wikimedia.org/ ) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).
The dataset supports the tasks of fill-mask.
The following languages are supported: bg, cs, da, de, el, en, es, et, fi, fr, ga, hr, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv
It is structured in the following format: {date}/{language}_{shard}.jsonl.xz At the moment only the date '20221120' is supported.
Use the dataset like this:
from datasets import load_dataset dataset = load_dataset('joelito/EU_Wikipedias', date="20221120", language="de", split='train', streaming=True)
The file format is jsonl.xz and there is one split available ( train ).
Source | Size (MB) | Words | Documents | Words/Document |
---|---|---|---|---|
20221120.all | 86034 | 9506846949 | 26481379 | 359 |
20221120.bg | 1261 | 88138772 | 285876 | 308 |
20221120.cs | 1904 | 189580185 | 513851 | 368 |
20221120.da | 679 | 74546410 | 286864 | 259 |
20221120.de | 11761 | 1191919523 | 2740891 | 434 |
20221120.el | 1531 | 103504078 | 215046 | 481 |
20221120.en | 26685 | 3192209334 | 6575634 | 485 |
20221120.es | 6636 | 801322400 | 1583597 | 506 |
20221120.et | 538 | 48618507 | 231609 | 209 |
20221120.fi | 1391 | 115779646 | 542134 | 213 |
20221120.fr | 9703 | 1140823165 | 2472002 | 461 |
20221120.ga | 72 | 8025297 | 57808 | 138 |
20221120.hr | 555 | 58853753 | 198746 | 296 |
20221120.hu | 1855 | 167732810 | 515777 | 325 |
20221120.it | 5999 | 687745355 | 1782242 | 385 |
20221120.lt | 409 | 37572513 | 203233 | 184 |
20221120.lv | 269 | 25091547 | 116740 | 214 |
20221120.mt | 29 | 2867779 | 5030 | 570 |
20221120.nl | 3208 | 355031186 | 2107071 | 168 |
20221120.pl | 3608 | 349900622 | 1543442 | 226 |
20221120.pt | 3315 | 389786026 | 1095808 | 355 |
20221120.ro | 1017 | 111455336 | 434935 | 256 |
20221120.sk | 506 | 49612232 | 238439 | 208 |
20221120.sl | 543 | 58858041 | 178472 | 329 |
20221120.sv | 2560 | 257872432 | 2556132 | 100 |
[More Information Needed]
[More Information Needed]
This dataset has been created by downloading the wikipedias using olm/wikipedia for the 24 EU languages. For more information about the creation of the dataset please refer to prepare_wikipedias.py
[More Information Needed]
[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
TODO add citation
Thanks to @JoelNiklaus for adding this dataset.