数据集:
graelo/wikipedia
This Wikipedia dataset contains all available languages for recent dumps. It is a refresh of the 20220301 wikipedia from Huggingface, so it has the same license and dataset card details. The benefits of this dataset are:
version | dump | # available languages | closed & dump | closed & no dump |
---|---|---|---|---|
1.0.0 | 20230601 | 328 | 9: ak (soon), cho, ho, ii, kj, lrc, mh, mus, ng | 4: aa, hz, kr, na |
1.1.0 | 20230601 | 329 (+et ~[az,ceb,ch,hr,ii,lrc,ta]) | 9: ak (soon), cho, ho, ii, kj, lrc, mh, mus, ng | 4: aa, hz, kr, na |
20230901 | see you in September... |
Source: List of Wikimedia Languages . A few (9) Wikimedias are closed, meaning they won't have new pages, but the dumps are still available. In addition, very few (4) Wikimedias are closed and don't have dumps anymore.
1.1.0
1.0.0
from datasets import load_dataset wikipedia_es = load_dataset("graelo/wikipedia", "20230601.es")
Developer only. This dataset was preprocessed with a Beam DirectRunner as follows.
Choose one wikipedia dump, for instance https://dumps.wikimedia.org/cewiki/ and identify the date.
This is optional because it not very likely that a new language will have suddenly appeared since the last version and have a significant dataset.
Navigate to https://en.wikipedia.org/wiki/List_of_Wikipedias and copy the languages column from the "Detailed list" table (near the end of the page).
Copy that content in the form of a Python list into lang_def.py (at the top of the repo) under a new date.
In order to properly extract links to images and media in all languages, we must refresh the two corresponding files. To do so, from the root of the repo, run
python -m prep.create_aliases
This will create or update these two files at the root of the repo:
These files are used in the final step
Running this script downloads the wikipedia dumps for each language in lang_def.py and shards each language dataset into the appropriate number of shards (max size ~ 250MB).
python -m prep.build --date 20230601
There are other options:
$ python -m prep.build --help usage: Wikipedia Builder [-h] [--date DATE] [--language [LANG ...]] [--cache-dir DIR] [--mirror MIRROR] Prepares the Wikipedia dataset for each language optional arguments: -h, --help show this help message and exit --date DATE Wikipedia dump date (e.g. 20230601) --language [LANG ...] Language code (e.g. en). If missing, all languages are processed --cache-dir DIR Cache directory for ? Datasets --mirror MIRROR Mirror URL
For instance, for faster downloads of the dumps, use the mirror option:
python -m prep.build \ --date 20230601 \ --language bs \ --mirror https://mirror.accum.se/mirror/wikimedia.org/dumps/
It will download the dumps at around 60MB/s instead of the capped speed (~4MB/s) from https://dumps.wikimedia.org . The script will skip existing directories, allowing you to run the script in several passes.
Notes: