中文

Wikipedia

This Wikipedia dataset contains all available languages for recent dumps. It is a refresh of the 20220301 wikipedia from Huggingface, so it has the same license and dataset card details. The benefits of this dataset are:

  • more recent dumps (see table below)
  • a few additional languages
  • all available languages are preprocessed (including the largests: en and ceb )
version dump # available languages closed & dump closed & no dump
1.0.0 20230601 328 9: ak (soon), cho, ho, ii, kj, lrc, mh, mus, ng 4: aa, hz, kr, na
1.1.0 20230601 329 (+et ~[az,ceb,ch,hr,ii,lrc,ta]) 9: ak (soon), cho, ho, ii, kj, lrc, mh, mus, ng 4: aa, hz, kr, na
20230901 see you in September...

Source: List of Wikimedia Languages . A few (9) Wikimedias are closed, meaning they won't have new pages, but the dumps are still available. In addition, very few (4) Wikimedias are closed and don't have dumps anymore.

Release Notes

1.1.0

  • feat : Add missing estonian (my bad), thanks Chris Ha
  • fix : update category lists for az, ceb, ch, hr, ii, lrc, ta, which means they were all processed again.

1.0.0

  • chore : File layout is now data/{dump}/{lang}/{info.json,*.parquet} . Sorry for the radical update, probably won't happen again.
  • chore : Parquet files are now sharded (size < 200 MB), allowing parallel downloads and processing.
  • fix : All languages were all processed again because of a bug in the media and category names, leading to some links not being extracted.
  • feat : Add en and ceb which were too big for my Beam DirectRunner at the time.

Usage

from datasets import load_dataset

wikipedia_es = load_dataset("graelo/wikipedia", "20230601.es")

Build instructions

Developer only. This dataset was preprocessed with a Beam DirectRunner as follows.

1. Determine the date of the dump you are interested in

Choose one wikipedia dump, for instance https://dumps.wikimedia.org/cewiki/ and identify the date.

2. [Optional] Get a refreshed list of languages

This is optional because it not very likely that a new language will have suddenly appeared since the last version and have a significant dataset.

Navigate to https://en.wikipedia.org/wiki/List_of_Wikipedias and copy the languages column from the "Detailed list" table (near the end of the page).

Copy that content in the form of a Python list into lang_def.py (at the top of the repo) under a new date.

3. [Optional] Create Media and Category aliases

In order to properly extract links to images and media in all languages, we must refresh the two corresponding files. To do so, from the root of the repo, run

python -m prep.create_aliases

This will create or update these two files at the root of the repo:

  • media_aliases.py
  • category_aliases.py

These files are used in the final step

4. Build and prepare the datasets into sharded parquet files

Running this script downloads the wikipedia dumps for each language in lang_def.py and shards each language dataset into the appropriate number of shards (max size ~ 250MB).

python -m prep.build --date 20230601

There are other options:

$ python -m prep.build --help
usage: Wikipedia Builder [-h] [--date DATE] [--language [LANG ...]] [--cache-dir DIR] [--mirror MIRROR]

Prepares the Wikipedia dataset for each language

optional arguments:
  -h, --help             show this help message and exit
  --date DATE            Wikipedia dump date (e.g. 20230601)
  --language [LANG ...]  Language code (e.g. en). If missing, all languages are processed
  --cache-dir DIR        Cache directory for ? Datasets
  --mirror MIRROR        Mirror URL

For instance, for faster downloads of the dumps, use the mirror option:

python -m prep.build \
    --date 20230601 \
    --language bs \
    --mirror https://mirror.accum.se/mirror/wikimedia.org/dumps/

It will download the dumps at around 60MB/s instead of the capped speed (~4MB/s) from https://dumps.wikimedia.org . The script will skip existing directories, allowing you to run the script in several passes.

Notes:

  • These instructions build upon the build process of the Wikipedia ? Dataset. HF did a fantastic job, I just pushed it a bit further.
  • Be aware that not all mirrors contain all dumps. For instance mirror.accum.se does not contain dumps for languages such as be-x-old or cbk-zam. My own solution is to run a first pass using the aforementioned mirror, and a second pass with the official https://dumps.wikimedia.org site (omitting the --mirror parameter).