数据集:
xtreme
The Cross-lingual Natural Language Inference (XNLI) corpus is a crowd-sourced collection of 5,000 test and 2,500 dev pairs for the MultiNLI corpus. The pairs are annotated with textual entailment and translated into 14 languages: French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili and Urdu. This results in 112.5k annotated pairs. Each premise can be associated with the corresponding hypothesis in the 15 languages, summing up to more than 1.5M combinations. The corpus is made to evaluate how to perform inference in any language (including low-resources ones like Swahili or Urdu) when only English NLI data is available at training time. One solution is cross-lingual sentence encoding, for which XNLI is an evaluation benchmark. The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages (spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks, and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil (spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the Niger-Congo languages Swahili and Yoruba, spoken in Africa.
An example of 'validation' looks as follows.
MLQA.ar.deAn example of 'validation' looks as follows.
MLQA.ar.enAn example of 'validation' looks as follows.
MLQA.ar.esAn example of 'validation' looks as follows.
MLQA.ar.hiAn example of 'validation' looks as follows.
The data fields are the same among all splits.
MLQA.ar.arname | validation | test |
---|---|---|
MLQA.ar.ar | 517 | 5335 |
MLQA.ar.de | 207 | 1649 |
MLQA.ar.en | 517 | 5335 |
MLQA.ar.es | 161 | 1978 |
MLQA.ar.hi | 186 | 1831 |
@InProceedings{conneau2018xnli, author = {Conneau, Alexis and Rinott, Ruty and Lample, Guillaume and Williams, Adina and Bowman, Samuel R. and Schwenk, Holger and Stoyanov, Veselin}, title = {XNLI: Evaluating Cross-lingual Sentence Representations}, booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing}, year = {2018}, publisher = {Association for Computational Linguistics}, location = {Brussels, Belgium}, } @article{hu2020xtreme, author = {Junjie Hu and Sebastian Ruder and Aditya Siddhant and Graham Neubig and Orhan Firat and Melvin Johnson}, title = {XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization}, journal = {CoRR}, volume = {abs/2003.11080}, year = {2020}, archivePrefix = {arXiv}, eprint = {2003.11080} }
Thanks to @thomwolf , @jplu , @lewtun , @lvwerra , @lhoestq , @patrickvonplaten , @mariamabarham for adding this dataset.