数据集:

Divyanshu/indicxnli

计算机处理:

multilingual

大小:

1M<n<10M

语言创建人:

machine-generated

批注创建人:

machine-generated

源数据集:

original

预印本库:

arxiv:2204.08776

许可:

cc0-1.0
中文

Dataset Card for "IndicXNLI"

Dataset Summary

INDICXNLI is similar to existing XNLI dataset in shape/form, but focusses on Indic language family. INDICXNLI include NLI data for eleven major Indic languages that includes Assamese (‘as’), Gujarat (‘gu’), Kannada (‘kn’), Malayalam (‘ml’), Marathi (‘mr’), Odia (‘or’), Punjabi (‘pa’), Tamil (‘ta’), Telugu (‘te’), Hindi (‘hi’), and Bengali (‘bn’).

Supported Tasks and Leaderboards

Tasks: Natural Language Inference

Leaderboards: Currently there is no Leaderboard for this dataset.

Languages

  • Assamese (as)
  • Bengali (bn)
  • Gujarati (gu)
  • Kannada (kn)
  • Hindi (hi)
  • Malayalam (ml)
  • Marathi (mr)
  • Oriya (or)
  • Punjabi (pa)
  • Tamil (ta)
  • Telugu (te)

Dataset Structure

Data Instances

One example from the hi dataset is given below in JSON format.

 {'premise': 'अवधारणात्मक रूप से क्रीम स्किमिंग के दो बुनियादी आयाम हैं-उत्पाद और भूगोल।',
 'hypothesis': 'उत्पाद और भूगोल क्रीम स्किमिंग का काम करते हैं।',
 'label': 1 (neutral) }

Data Fields

  • premise (string) : Premise Sentence
  • hypothesis (string) : Hypothesis Sentence
  • label (integer) : Integer label 0 if hypothesis entails the premise, 2 if hypothesis negates the premise and 1 otherwise.

Data Splits

Language ISO 639-1 Code Train Test Dev
Assamese as 392,702 5,010 2,490
Bengali bn 392,702 5,010 2,490
Gujarati gu 392,702 5,010 2,490
Hindi hi 392,702 5,010 2,490
Kannada kn 392,702 5,010 2,490
Malayalam ml 392,702 5,010 2,490
Marathi mr 392,702 5,010 2,490
Oriya or 392,702 5,010 2,490
Punjabi pa 392,702 5,010 2,490
Tamil ta 392,702 5,010 2,490
Telugu te 392,702 5,010 2,490

Dataset usage

Code snippet for using the dataset using datasets library.

from datasets import load_dataset

dataset = load_dataset("Divyanshu/indicxnli")

Dataset Creation

Machine translation of XNLI english dataset to 11 listed Indic Languages.

Curation Rationale

[More information needed]

Source Data

XNLI dataset

Initial Data Collection and Normalization

Detailed in the paper

Who are the source language producers?

Detailed in the paper

Human Verification Process

Detailed in the paper

Considerations for Using the Data

Social Impact of Dataset

Detailed in the paper

Discussion of Biases

Detailed in the paper

Other Known Limitations

Detailed in the paper

Dataset Curators

Divyanshu Aggarwal, Vivek Gupta, Anoop Kunchukuttan

Licensing Information

Contents of this repository are restricted to only non-commercial research purposes under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0) . Copyright of the dataset contents belongs to the original copyright holders.

Citation Information

If you use any of the datasets, models or code modules, please cite the following paper:

@misc{https://doi.org/10.48550/arxiv.2204.08776,
  doi = {10.48550/ARXIV.2204.08776},
  
  url = {https://arxiv.org/abs/2204.08776},
  
  author = {Aggarwal, Divyanshu and Gupta, Vivek and Kunchukuttan, Anoop},
  
  keywords = {Computation and Language (cs.CL), Artificial Intelligence (cs.AI), FOS: Computer and information sciences, FOS: Computer and information sciences},
  
  title = {IndicXNLI: Evaluating Multilingual Inference for Indian Languages}, 
  
  publisher = {arXiv},
  
  year = {2022},
  
  copyright = {Creative Commons Attribution 4.0 International}
}