数据集:
indonli
许可:
cc-by-sa-4.0源数据集:
original语言创建人:
expert-generated大小:
10K<n<100K计算机处理:
monolingual语言:
id任务:
文本分类IndoNLI is the first human-elicited Natural Language Inference (NLI) dataset for Indonesian. IndoNLI is annotated by both crowd workers and experts. The expert-annotated data is used exclusively as a test set. It is designed to provide a challenging test-bed for Indonesian NLI by explicitly incorporating various linguistic phenomena such as numerical reasoning, structural changes, idioms, or temporal and spatial reasoning.
Indonesian
An example of train looks as follows.
{ "premise": "Keindahan alam yang terdapat di Gunung Batu Jonggol ini dapat Anda manfaatkan sebagai objek fotografi yang cantik.", "hypothesis": "Keindahan alam tidak dapat difoto.", "label": 2 }
The data fields are:
The data is split across train , valid , test_lay , and test_expert .
test_expert is written by expert annotators, whereas the rest are written by lay annotators.
split | # examples |
---|---|
train | 10330 |
valid | 2197 |
test_lay | 2201 |
test_expert | 2984 |
A small subset of test_expert is used as a diasnostic tool. For more info, please visit https://github.com/ir-nlp-csui/indonli
Indonesian NLP is considered under-resourced. Up until now, there is no publicly available human-annotated NLI dataset for Indonesian.
The premise were collected from Indonesian Wikipedia and from other public Indonesian dataset: Indonesian PUD and GSD treebanks provided by the Universal Dependencies 2.5 and IndoSum
The hypothesis were written by annotators.
Who are the source language producers?The data was produced by humans.
We start by writing the hypothesis, given the premise and the target label. Then, we ask 2 different independent annotators to predict the label, given the premise and hypothesis. If all 3 (the original hypothesis + 2 independent annotators) agree with the label, then the annotation process ends for that sample. Otherwise, we incrementally ask additional annotator until 3 annotators agree with the label. If there's no majority concensus after 5 annotations, the sample is removed.
Who are the annotators?Lay annotators were computer science students, and expert annotators were NLP scientists with 7+ years research experience in NLP. All annotators are native speakers. Additionally, expert annotators were explicitly instructed to provide challenging examples by incorporating various linguistic phenomena such as numerical reasoning, structural changes, idioms, or temporal and spatial reasoning. Annotators were compensated based on hourly rate.
There might be some personal information coming from Wikipedia and news, especially the information of famous/important people.
INDONLI is created using premise sentences taken from Wikipedia and news. These data sources may contain some bias.
No other known limitations
This dataset is the result of the collaborative work of Indonesian researchers from the University of Indonesia, kata.ai, New York University, Fondazione Bruno Kessler, and the University of St Andrews.
CC-BY-SA 4.0.
Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.
Please contact authors for any information on the dataset.
@inproceedings{mahendra-etal-2021-indonli, title = "{I}ndo{NLI}: A Natural Language Inference Dataset for {I}ndonesian", author = "Mahendra, Rahmad and Aji, Alham Fikri and Louvan, Samuel and Rahman, Fahrurrozi and Vania, Clara", booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing", month = nov, year = "2021", address = "Online and Punta Cana, Dominican Republic", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.emnlp-main.821", pages = "10511--10527", }
Thanks to @afaji for adding this dataset.