数据集:

multi_nli

任务:

文本分类

子任务:

natural-language-inference multi-input-text-classification

语言:

计算机处理:

monolingual

大小:

100K<n<1M

语言创建人:

crowdsourced found

批注创建人:

crowdsourced

源数据集:

original

许可:

数据集介绍文件清单

Dataset Card for Multi-Genre Natural Language Inference (MultiNLI)

Dataset Summary

The Multi-Genre Natural Language Inference (MultiNLI) corpus is a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information. The corpus is modeled on the SNLI corpus, but differs in that covers a range of genres of spoken and written text, and supports a distinctive cross-genre generalization evaluation. The corpus served as the basis for the shared task of the RepEval 2017 Workshop at EMNLP in Copenhagen.

Supported Tasks and Leaderboards

More Information Needed

Languages

The dataset contains samples in English only.

Dataset Structure

Data Instances

Size of downloaded dataset files: 226.85 MB
Size of the generated dataset: 76.95 MB
Total amount of disk used: 303.81 MB

Example of a data instance:

{
    "promptID": 31193,
    "pairID": "31193n",
    "premise": "Conceptually cream skimming has two basic dimensions - product and geography.",
    "premise_binary_parse": "( ( Conceptually ( cream skimming ) ) ( ( has ( ( ( two ( basic dimensions ) ) - ) ( ( product and ) geography ) ) ) . ) )",
    "premise_parse": "(ROOT (S (NP (JJ Conceptually) (NN cream) (NN skimming)) (VP (VBZ has) (NP (NP (CD two) (JJ basic) (NNS dimensions)) (: -) (NP (NN product) (CC and) (NN geography)))) (. .)))",
    "hypothesis": "Product and geography are what make cream skimming work. ",
    "hypothesis_binary_parse": "( ( ( Product and ) geography ) ( ( are ( what ( make ( cream ( skimming work ) ) ) ) ) . ) )",
    "hypothesis_parse": "(ROOT (S (NP (NN Product) (CC and) (NN geography)) (VP (VBP are) (SBAR (WHNP (WP what)) (S (VP (VBP make) (NP (NP (NN cream)) (VP (VBG skimming) (NP (NN work)))))))) (. .)))",
    "genre": "government",
    "label": 1
}

Data Fields

The data fields are the same among all splits.

promptID : Unique identifier for prompt
pairID : Unique identifier for pair
{premise,hypothesis} : combination of premise and hypothesis
{premise,hypothesis} parse : Each sentence as parsed by the Stanford PCFG Parser 3.5.2
{premise,hypothesis} binary parse : parses in unlabeled binary-branching format
genre : a string feature.
label : a classification label, with possible values including entailment (0), neutral (1), contradiction (2). Dataset instances which don't have any gold label are marked with -1 label. Make sure you filter them before starting the training using datasets.Dataset.filter .

Data Splits

train	validation_matched	validation_mismatched
392702	9815	9832

Dataset Creation

Curation Rationale

They constructed MultiNLI so as to make it possible to explicitly evaluate models both on the quality of their sentence representations within the training domain and on their ability to derive reasonable representations in unfamiliar domains.

Source Data

Initial Data Collection and Normalization

They created each sentence pair by selecting a premise sentence from a preexisting text source and asked a human annotator to compose a novel sentence to pair with it as a hypothesis.

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

More Information Needed

Licensing Information

The majority of the corpus is released under the OANC’s license, which allows all content to be freely used, modified, and shared under permissive terms. The data in the FICTION section falls under several permissive licenses; Seven Swords is available under a Creative Commons Share-Alike 3.0 Unported License, and with the explicit permission of the author, Living History and Password Incorrect are available under Creative Commons Attribution 3.0 Unported Licenses; the remaining works of fiction are in the public domain in the United States (but may be licensed differently elsewhere).

Citation Information

@InProceedings{N18-1101,
  author = "Williams, Adina
            and Nangia, Nikita
            and Bowman, Samuel",
  title = "A Broad-Coverage Challenge Corpus for
           Sentence Understanding through Inference",
  booktitle = "Proceedings of the 2018 Conference of
               the North American Chapter of the
               Association for Computational Linguistics:
               Human Language Technologies, Volume 1 (Long
               Papers)",
  year = "2018",
  publisher = "Association for Computational Linguistics",
  pages = "1112--1122",
  location = "New Orleans, Louisiana",
  url = "http://aclweb.org/anthology/N18-1101"
}

Contributions

Thanks to @bhavitvyamalik , @patrickvonplaten , @thomwolf , @mariamabarham for adding this dataset.

作者:

佚名

数据集大小:

17.45 KB