数据集:
unicamp-dl/mrobust
mRobust is a multilingual version of the TREC 2004 Robust passage ranking dataset . For more information, checkout our papers: <!-- * mRobust: A Multilingual Version of the MS MARCO Passage Ranking Dataset
The current version is composed 10 languages: Chinese, French, German, Indonesian, Italian, Portuguese, Russian, Spanish, Dutch and Vietnamese.
Language name | Language code |
---|---|
English | english |
Chinese | chinese |
French | french |
German | german |
Indonesian | indonesian |
Italian | italian |
Portuguese | portuguese |
Russian | russian |
Spanish | spanish |
Dutch | dutch |
Vietnamese | vietnamese |
You can load mRobust dataset by choosing a specific language. We include the translated collections of documents and queries.
Queries>>> dataset = load_dataset('unicamp-dl/mrobust', 'queries-spanish') >>> dataset['queries'][1] {'id': '302', 'text': '¿Está controlada la enfermedad de la poliomielitis (polio) en el mundo?'}Collection
>>> dataset = load_dataset('unicamp-dl/mrobust', 'collection-portuguese') >>> dataset['collection'][5] {'id': 'FT931-16660', 'text': '930105 FT 05 JAN 93 / Cenelec: Correção O endereço do Cenelec, Comitê Europeu de Normalização Eletrotécnica, estava incorreto na edição de ontem. É Rue de Stassart 35, B-1050, Bruxelas, Tel (322) 519 6871. CEN, Comitê Europeu de Normalização, está localizado na Rue de Stassart 36, B-1050, Bruxelas, Tel 519 6811.'}
@misc{https://doi.org/10.48550/arxiv.2209.13738, doi = {10.48550/ARXIV.2209.13738}, url = {https://arxiv.org/abs/2209.13738}, author = {Jeronymo, Vitor and Nascimento, Mauricio and Lotufo, Roberto and Nogueira, Rodrigo}, title = {mRobust04: A Multilingual Version of the TREC Robust 2004 Benchmark}, publisher = {arXiv}, year = {2022}, copyright = {Creative Commons Attribution 4.0 International} }