数据集:

xnli

中文

Dataset Card for "xnli"

Dataset Summary

XNLI is a subset of a few thousand examples from MNLI which has been translated into a 14 different languages (some low-ish resource). As with MNLI, the goal is to predict textual entailment (does sentence A imply/contradict/neither sentence B) and is a classification task (given two sentences, predict one of three labels).

Supported Tasks and Leaderboards

More Information Needed

Languages

More Information Needed

Dataset Structure

Data Instances

all_languages
  • Size of downloaded dataset files: 483.96 MB
  • Size of the generated dataset: 1.61 GB
  • Total amount of disk used: 2.09 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "hypothesis": "{\"language\": [\"ar\", \"bg\", \"de\", \"el\", \"en\", \"es\", \"fr\", \"hi\", \"ru\", \"sw\", \"th\", \"tr\", \"ur\", \"vi\", \"zh\"], \"translation\": [\"احد اع...",
    "label": 0,
    "premise": "{\"ar\": \"واحدة من رقابنا ستقوم بتنفيذ تعليماتك كلها بكل دقة\", \"bg\": \"един от нашите номера ще ви даде инструкции .\", \"de\": \"Eine ..."
}
ar
  • Size of downloaded dataset files: 483.96 MB
  • Size of the generated dataset: 109.32 MB
  • Total amount of disk used: 593.29 MB

An example of 'validation' looks as follows.

{
    "hypothesis": "اتصل بأمه حالما أوصلته حافلة المدرسية.",
    "label": 1,
    "premise": "وقال، ماما، لقد عدت للمنزل."
}
bg
  • Size of downloaded dataset files: 483.96 MB
  • Size of the generated dataset: 128.32 MB
  • Total amount of disk used: 612.28 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "hypothesis": "\"губиш нещата на следното ниво , ако хората си припомнят .\"...",
    "label": 0,
    "premise": "\"по време на сезона и предполагам , че на твоето ниво ще ги загубиш на следващото ниво , ако те решат да си припомнят отбора на ..."
}
de
  • Size of downloaded dataset files: 483.96 MB
  • Size of the generated dataset: 86.17 MB
  • Total amount of disk used: 570.14 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "hypothesis": "Man verliert die Dinge auf die folgende Ebene , wenn sich die Leute erinnern .",
    "label": 0,
    "premise": "\"Du weißt , während der Saison und ich schätze , auf deiner Ebene verlierst du sie auf die nächste Ebene , wenn sie sich entschl..."
}
el
  • Size of downloaded dataset files: 483.96 MB
  • Size of the generated dataset: 142.30 MB
  • Total amount of disk used: 626.26 MB

An example of 'validation' looks as follows.

This example was too long and was cropped:

{
    "hypothesis": "\"Τηλεφώνησε στη μαμά του μόλις το σχολικό λεωφορείο τον άφησε.\"...",
    "label": 1,
    "premise": "Και είπε, Μαμά, έφτασα στο σπίτι."
}

Data Fields

The data fields are the same among all splits.

all_languages
  • premise : a multilingual string variable, with possible languages including ar , bg , de , el , en .
  • hypothesis : a multilingual string variable, with possible languages including ar , bg , de , el , en .
  • label : a classification label, with possible values including entailment (0), neutral (1), contradiction (2).
ar
  • premise : a string feature.
  • hypothesis : a string feature.
  • label : a classification label, with possible values including entailment (0), neutral (1), contradiction (2).
bg
  • premise : a string feature.
  • hypothesis : a string feature.
  • label : a classification label, with possible values including entailment (0), neutral (1), contradiction (2).
de
  • premise : a string feature.
  • hypothesis : a string feature.
  • label : a classification label, with possible values including entailment (0), neutral (1), contradiction (2).
el
  • premise : a string feature.
  • hypothesis : a string feature.
  • label : a classification label, with possible values including entailment (0), neutral (1), contradiction (2).

Data Splits

name train validation test
all_languages 392702 2490 5010
ar 392702 2490 5010
bg 392702 2490 5010
de 392702 2490 5010
el 392702 2490 5010

Dataset Creation

Curation Rationale

More Information Needed

Source Data

Initial Data Collection and Normalization

More Information Needed

Who are the source language producers?

More Information Needed

Annotations

Annotation process

More Information Needed

Who are the annotators?

More Information Needed

Personal and Sensitive Information

More Information Needed

Considerations for Using the Data

Social Impact of Dataset

More Information Needed

Discussion of Biases

More Information Needed

Other Known Limitations

More Information Needed

Additional Information

Dataset Curators

More Information Needed

Licensing Information

More Information Needed

Citation Information

@InProceedings{conneau2018xnli,
  author = {Conneau, Alexis
                 and Rinott, Ruty
                 and Lample, Guillaume
                 and Williams, Adina
                 and Bowman, Samuel R.
                 and Schwenk, Holger
                 and Stoyanov, Veselin},
  title = {XNLI: Evaluating Cross-lingual Sentence Representations},
  booktitle = {Proceedings of the 2018 Conference on Empirical Methods
               in Natural Language Processing},
  year = {2018},
  publisher = {Association for Computational Linguistics},
  location = {Brussels, Belgium},
}

Contributions

Thanks to @lewtun , @mariamabarham , @thomwolf , @lhoestq , @patrickvonplaten for adding this dataset.