数据集:
multi_nli_mismatch
The Multi-Genre Natural Language Inference (MultiNLI) corpus is a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information. The corpus is modeled on the SNLI corpus, but differs in that covers a range of genres of spoken and written text, and supports a distinctive cross-genre generalization evaluation. The corpus served as the basis for the shared task of the RepEval 2017 Workshop at EMNLP in Copenhagen.
An example of 'train' looks as follows.
{ "hypothesis": "independence", "label": "contradiction", "premise": "correlation" }
The data fields are the same among all splits.
plain_textname | train | validation |
---|---|---|
plain_text | 392702 | 10000 |
@InProceedings{N18-1101, author = "Williams, Adina and Nangia, Nikita and Bowman, Samuel", title = "A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference", booktitle = "Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)", year = "2018", publisher = "Association for Computational Linguistics", pages = "1112--1122", location = "New Orleans, Louisiana", url = "http://aclweb.org/anthology/N18-1101" }
Thanks to @thomwolf , @patrickvonplaten , @mariamabarham for adding this dataset.