Dataset Card for MIRACL Corpus

MIRACL 🌍🙌🌏 (Multilingual Information Retrieval Across a Continuum of Languages) is a multilingual retrieval dataset that focuses on search across 18 different languages, which collectively encompass over three billion native speakers around the world.

This dataset contains the collection data of the 16 "known languages". The remaining 2 "surprise languages" will not be released until later.

The corpus for each language is prepared from a Wikipedia dump, where we keep only the plain text and discard images, tables, etc. Each article is segmented into multiple passages using WikiExtractor based on natural discourse units (e.g., \n\n in the wiki markup). Each of these passages comprises a "document" or unit of retrieval. We preserve the Wikipedia article title of each passage.

Dataset Structure

Each retrieval unit contains three fields: docid , title , and text . Consider an example from the English corpus:

{
    "docid": "39#0",
    "title": "Albedo", 
    "text": "Albedo (meaning 'whiteness') is the measure of the diffuse reflection of solar radiation out of the total solar radiation received by an astronomical body (e.g. a planet like Earth). It is dimensionless and measured on a scale from 0 (corresponding to a black body that absorbs all incident radiation) to 1 (corresponding to a body that reflects all incident radiation)."
}

The docid has the schema X#Y , where all passages with the same X come from the same Wikipedia article, whereas Y denotes the passage within that article, numbered sequentially. The text field contains the text of the passage. The title field contains the name of the article the passage comes from.

The collection can be loaded using:

lang='ar'  # or any of the 16 languages
miracl_corpus = datasets.load_dataset('miracl/miracl-corpus', lang)['train']
for doc in miracl_corpus:
   docid = doc['docid']
   title = doc['title']
   text = doc['text']

Dataset Statistics and Links

The following table contains the number of passage and Wikipedia articles in the collection of each language, along with the links to the datasets and raw Wikipedia dumps.

Language	# of Passages	# of Articles	Links	Raw Wiki Dump
Arabic (ar)	2,061,414	656,982	🤗	🌏
Bengali (bn)	297,265	63,762	🤗	🌏
English (en)	32,893,221	5,758,285	🤗	🌏
Spanish (es)	10,373,953	1,669,181	🤗	🌏
Persian (fa)	2,207,172	857,827	🤗	🌏
Finnish (fi)	1,883,509	447,815	🤗	🌏
French (fr)	14,636,953	2,325,608	🤗	🌏
Hindi (hi)	506,264	148,107	🤗	🌏
Indonesian (id)	1,446,315	446,330	🤗	🌏
Japanese (ja)	6,953,614	1,133,444	🤗	🌏
Korean (ko)	1,486,752	437,373	🤗	🌏
Russian (ru)	9,543,918	1,476,045	🤗	🌏
Swahili (sw)	131,924	47,793	🤗	🌏
Telugu (te)	518,079	66,353	🤗	🌏
Thai (th)	542,166	128,179	🤗	🌏
Chinese (zh)	4,934,368	1,246,389	🤗	🌏

作者:

miracl

数据集大小:

13.97 GB