数据集:

unicamp-dl/mrobust

中文

Dataset Summary

mRobust is a multilingual version of the TREC 2004 Robust passage ranking dataset . For more information, checkout our papers: <!-- * mRobust: A Multilingual Version of the MS MARCO Passage Ranking Dataset

The current version is composed 10 languages: Chinese, French, German, Indonesian, Italian, Portuguese, Russian, Spanish, Dutch and Vietnamese.

Supported languages

Language name Language code
English english
Chinese chinese
French french
German german
Indonesian indonesian
Italian italian
Portuguese portuguese
Russian russian
Spanish spanish
Dutch dutch
Vietnamese vietnamese

Dataset Structure

You can load mRobust dataset by choosing a specific language. We include the translated collections of documents and queries.

Queries
>>> dataset = load_dataset('unicamp-dl/mrobust', 'queries-spanish')
>>> dataset['queries'][1]
{'id': '302', 'text': '¿Está controlada la enfermedad de la poliomielitis (polio) en el mundo?'}
Collection
>>> dataset = load_dataset('unicamp-dl/mrobust', 'collection-portuguese')
>>> dataset['collection'][5]
{'id': 'FT931-16660', 'text': '930105 FT 05 JAN 93 / Cenelec: Correção O endereço do Cenelec, Comitê Europeu de Normalização Eletrotécnica, estava incorreto na edição de ontem. É Rue de Stassart 35, B-1050, Bruxelas, Tel (322) 519 6871. CEN, Comitê Europeu de Normalização, está localizado na Rue de Stassart 36, B-1050, Bruxelas, Tel 519 6811.'}

Citation Information

@misc{https://doi.org/10.48550/arxiv.2209.13738,
  doi = {10.48550/ARXIV.2209.13738},
  url = {https://arxiv.org/abs/2209.13738},
  author = {Jeronymo, Vitor and Nascimento, Mauricio and Lotufo, Roberto and Nogueira, Rodrigo},
  title = {mRobust04: A Multilingual Version of the TREC Robust 2004 Benchmark},
  publisher = {arXiv},
  year = {2022},
  copyright = {Creative Commons Attribution 4.0 International}
}