数据集:

Divyanshu/indicxnli

任务:

文本分类

子任务:

natural-language-inference

语言:

计算机处理:

multilingual

大小:

1M<n<10M

语言创建人:

machine-generated

批注创建人:

machine-generated

源数据集:

original

预印本库:

arxiv:2204.08776

许可:

cc0-1.0

数据集介绍文件清单

中文

Dataset Card for "IndicXNLI"

Dataset Summary

INDICXNLI is similar to existing XNLI dataset in shape/form, but focusses on Indic language family. INDICXNLI include NLI data for eleven major Indic languages that includes Assamese (‘as’), Gujarat (‘gu’), Kannada (‘kn’), Malayalam (‘ml’), Marathi (‘mr’), Odia (‘or’), Punjabi (‘pa’), Tamil (‘ta’), Telugu (‘te’), Hindi (‘hi’), and Bengali (‘bn’).

Supported Tasks and Leaderboards

Tasks: Natural Language Inference

Leaderboards: Currently there is no Leaderboard for this dataset.

Languages

Assamese (as)
Bengali (bn)
Gujarati (gu)
Kannada (kn)
Hindi (hi)
Malayalam (ml)
Marathi (mr)
Oriya (or)
Punjabi (pa)
Tamil (ta)
Telugu (te)

Dataset Structure

Data Instances

One example from the hi dataset is given below in JSON format.

 {'premise': 'अवधारणात्मक रूप से क्रीम स्किमिंग के दो बुनियादी आयाम हैं-उत्पाद और भूगोल।',
 'hypothesis': 'उत्पाद और भूगोल क्रीम स्किमिंग का काम करते हैं।',
 'label': 1 (neutral) }

Data Fields

premise (string) : Premise Sentence
hypothesis (string) : Hypothesis Sentence
label (integer) : Integer label 0 if hypothesis entails the premise, 2 if hypothesis negates the premise and 1 otherwise.

Data Splits

Language	ISO 639-1 Code	Train	Test	Dev
Assamese	as	392,702	5,010	2,490
Bengali	bn	392,702	5,010	2,490
Gujarati	gu	392,702	5,010	2,490
Hindi	hi	392,702	5,010	2,490
Kannada	kn	392,702	5,010	2,490
Malayalam	ml	392,702	5,010	2,490
Marathi	mr	392,702	5,010	2,490
Oriya	or	392,702	5,010	2,490
Punjabi	pa	392,702	5,010	2,490
Tamil	ta	392,702	5,010	2,490
Telugu	te	392,702	5,010	2,490

Dataset usage

Code snippet for using the dataset using datasets library.

from datasets import load_dataset

dataset = load_dataset("Divyanshu/indicxnli")

Dataset Creation

Machine translation of XNLI english dataset to 11 listed Indic Languages.

Curation Rationale

[More information needed]

Source Data

XNLI dataset

Initial Data Collection and Normalization

Detailed in the paper

Who are the source language producers?

Detailed in the paper

Human Verification Process

Detailed in the paper

Considerations for Using the Data

Dataset Curators

Divyanshu Aggarwal, Vivek Gupta, Anoop Kunchukuttan

Licensing Information

Contents of this repository are restricted to only non-commercial research purposes under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0) . Copyright of the dataset contents belongs to the original copyright holders.

Citation Information

If you use any of the datasets, models or code modules, please cite the following paper:

@misc{https://doi.org/10.48550/arxiv.2204.08776,
  doi = {10.48550/ARXIV.2204.08776},
  
  url = {https://arxiv.org/abs/2204.08776},
  
  author = {Aggarwal, Divyanshu and Gupta, Vivek and Kunchukuttan, Anoop},
  
  keywords = {Computation and Language (cs.CL), Artificial Intelligence (cs.AI), FOS: Computer and information sciences, FOS: Computer and information sciences},
  
  title = {IndicXNLI: Evaluating Multilingual Inference for Indian Languages}, 
  
  publisher = {arXiv},
  
  year = {2022},
  
  copyright = {Creative Commons Attribution 4.0 International}
}

作者:

Divyanshu

数据集大小:

2.34 GB

Dataset Card for "IndicXNLI"

Dataset Summary

Supported Tasks and Leaderboards

Languages

Dataset Structure

Data Instances

Data Fields

Data Splits

Dataset usage

Dataset Creation

Curation Rationale

Source Data

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Dataset Curators

Licensing Information

Citation Information