数据集:
Divyanshu/indicxnli
任务:
文本分类计算机处理:
multilingual大小:
1M<n<10M语言创建人:
machine-generated批注创建人:
machine-generated源数据集:
original预印本库:
arxiv:2204.08776许可:
cc0-1.0INDICXNLI is similar to existing XNLI dataset in shape/form, but focusses on Indic language family. INDICXNLI include NLI data for eleven major Indic languages that includes Assamese (‘as’), Gujarat (‘gu’), Kannada (‘kn’), Malayalam (‘ml’), Marathi (‘mr’), Odia (‘or’), Punjabi (‘pa’), Tamil (‘ta’), Telugu (‘te’), Hindi (‘hi’), and Bengali (‘bn’).
Tasks: Natural Language Inference
Leaderboards: Currently there is no Leaderboard for this dataset.
One example from the hi dataset is given below in JSON format.
{'premise': 'अवधारणात्मक रूप से क्रीम स्किमिंग के दो बुनियादी आयाम हैं-उत्पाद और भूगोल।', 'hypothesis': 'उत्पाद और भूगोल क्रीम स्किमिंग का काम करते हैं।', 'label': 1 (neutral) }
Language | ISO 639-1 Code | Train | Test | Dev |
---|---|---|---|---|
Assamese | as | 392,702 | 5,010 | 2,490 |
Bengali | bn | 392,702 | 5,010 | 2,490 |
Gujarati | gu | 392,702 | 5,010 | 2,490 |
Hindi | hi | 392,702 | 5,010 | 2,490 |
Kannada | kn | 392,702 | 5,010 | 2,490 |
Malayalam | ml | 392,702 | 5,010 | 2,490 |
Marathi | mr | 392,702 | 5,010 | 2,490 |
Oriya | or | 392,702 | 5,010 | 2,490 |
Punjabi | pa | 392,702 | 5,010 | 2,490 |
Tamil | ta | 392,702 | 5,010 | 2,490 |
Telugu | te | 392,702 | 5,010 | 2,490 |
Code snippet for using the dataset using datasets library.
from datasets import load_dataset dataset = load_dataset("Divyanshu/indicxnli")
Machine translation of XNLI english dataset to 11 listed Indic Languages.
[More information needed]
Divyanshu Aggarwal, Vivek Gupta, Anoop Kunchukuttan
Contents of this repository are restricted to only non-commercial research purposes under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0) . Copyright of the dataset contents belongs to the original copyright holders.
If you use any of the datasets, models or code modules, please cite the following paper:
@misc{https://doi.org/10.48550/arxiv.2204.08776, doi = {10.48550/ARXIV.2204.08776}, url = {https://arxiv.org/abs/2204.08776}, author = {Aggarwal, Divyanshu and Gupta, Vivek and Kunchukuttan, Anoop}, keywords = {Computation and Language (cs.CL), Artificial Intelligence (cs.AI), FOS: Computer and information sciences, FOS: Computer and information sciences}, title = {IndicXNLI: Evaluating Multilingual Inference for Indian Languages}, publisher = {arXiv}, year = {2022}, copyright = {Creative Commons Attribution 4.0 International} }