数据集:
miracl/miracl-corpus
任务:
文本检索子任务:
document-retrieval计算机处理:
multilingual批注创建人:
expert-generated预印本库:
arxiv:2210.09984许可:
apache-2.0MIRACL ??? (Multilingual Information Retrieval Across a Continuum of Languages) is a multilingual retrieval dataset that focuses on search across 18 different languages, which collectively encompass over three billion native speakers around the world.
This dataset contains the collection data of the 16 "known languages". The remaining 2 "surprise languages" will not be released until later.
The corpus for each language is prepared from a Wikipedia dump, where we keep only the plain text and discard images, tables, etc. Each article is segmented into multiple passages using WikiExtractor based on natural discourse units (e.g., \n\n in the wiki markup). Each of these passages comprises a "document" or unit of retrieval. We preserve the Wikipedia article title of each passage.
Each retrieval unit contains three fields: docid , title , and text . Consider an example from the English corpus:
{ "docid": "39#0", "title": "Albedo", "text": "Albedo (meaning 'whiteness') is the measure of the diffuse reflection of solar radiation out of the total solar radiation received by an astronomical body (e.g. a planet like Earth). It is dimensionless and measured on a scale from 0 (corresponding to a black body that absorbs all incident radiation) to 1 (corresponding to a body that reflects all incident radiation)." }
The docid has the schema X#Y , where all passages with the same X come from the same Wikipedia article, whereas Y denotes the passage within that article, numbered sequentially. The text field contains the text of the passage. The title field contains the name of the article the passage comes from.
The collection can be loaded using:
lang='ar' # or any of the 16 languages miracl_corpus = datasets.load_dataset('miracl/miracl-corpus', lang)['train'] for doc in miracl_corpus: docid = doc['docid'] title = doc['title'] text = doc['text']
The following table contains the number of passage and Wikipedia articles in the collection of each language, along with the links to the datasets and raw Wikipedia dumps.
Language | # of Passages | # of Articles | Links | Raw Wiki Dump |
---|---|---|---|---|
Arabic (ar) | 2,061,414 | 656,982 | ? | ? |
Bengali (bn) | 297,265 | 63,762 | ? | ? |
English (en) | 32,893,221 | 5,758,285 | ? | ? |
Spanish (es) | 10,373,953 | 1,669,181 | ? | ? |
Persian (fa) | 2,207,172 | 857,827 | ? | ? |
Finnish (fi) | 1,883,509 | 447,815 | ? | ? |
French (fr) | 14,636,953 | 2,325,608 | ? | ? |
Hindi (hi) | 506,264 | 148,107 | ? | ? |
Indonesian (id) | 1,446,315 | 446,330 | ? | ? |
Japanese (ja) | 6,953,614 | 1,133,444 | ? | ? |
Korean (ko) | 1,486,752 | 437,373 | ? | ? |
Russian (ru) | 9,543,918 | 1,476,045 | ? | ? |
Swahili (sw) | 131,924 | 47,793 | ? | ? |
Telugu (te) | 518,079 | 66,353 | ? | ? |
Thai (th) | 542,166 | 128,179 | ? | ? |
Chinese (zh) | 4,934,368 | 1,246,389 | ? | ? |