数据集:
indic_glue
IndicGLUE is a natural language understanding benchmark for Indian languages. It contains a wide variety of tasks and covers 11 major Indian languages - as, bn, gu, hi, kn, ml, mr, or, pa, ta, te.
The Winograd Schema Challenge (Levesque et al., 2011) is a reading comprehension task in which a system must read a sentence with a pronoun and select the referent of that pronoun from a list of choices. The examples are manually constructed to foil simple statistical methods: Each one is contingent on contextual information provided by a single word or phrase in the sentence. To convert the problem into sentence pair classification, we construct sentence pairs by replacing the ambiguous pronoun with each possible referent. The task is to predict if the sentence with the pronoun substituted is entailed by the original sentence. We use a small evaluation set consisting of new examples derived from fiction books that was shared privately by the authors of the original corpus. While the included training set is balanced between two classes, the test set is imbalanced between them (65% not entailment). Also, due to a data quirk, the development set is adversarial: hypotheses are sometimes shared between training and development examples, so if a model memorizes the training examples, they will predict the wrong label on corresponding development set example. As with QNLI, each example is evaluated separately, so there is not a systematic correspondence between a model's score on this task and its score on the unconverted original task. We call converted dataset WNLI (Winograd NLI). This dataset is translated and publicly released for 3 Indian languages by AI4Bharat.
An example of 'validation' looks as follows.
This example was too long and was cropped: { "label": 0, "text": "\"ప్రయాణాల్లో ఉన్నవారికోసం బస్ స్టేషన్లు, రైల్వే స్టేషన్లలో పల్స్పోలియో బూతులను ఏర్పాటు చేసి చిన్నారులకు పోలియో చుక్కలు వేసేలా ఏర..." }bbca.hi
An example of 'train' looks as follows.
This example was too long and was cropped: { "label": "pakistan", "text": "\"नेटिजन यानि इंटरनेट पर सक्रिय नागरिक अब ट्विटर पर सरकार द्वारा लगाए प्रतिबंधों के समर्थन या विरोध में अपने विचार व्यक्त करते है..." }copa.en
An example of 'validation' looks as follows.
{ "choice1": "I swept the floor in the unoccupied room.", "choice2": "I shut off the light in the unoccupied room.", "label": 1, "premise": "I wanted to conserve energy.", "question": "effect" }copa.gu
An example of 'train' looks as follows.
This example was too long and was cropped: { "choice1": "\"સ્ત્રી જાણતી હતી કે તેનો મિત્ર મુશ્કેલ સમયમાંથી પસાર થઈ રહ્યો છે.\"...", "choice2": "\"મહિલાને લાગ્યું કે તેના મિત્રએ તેની દયાળુ લાભ લીધો છે.\"...", "label": 0, "premise": "મહિલાએ તેના મિત્રની મુશ્કેલ વર્તન સહન કરી.", "question": "cause" }copa.hi
An example of 'validation' looks as follows.
{ "choice1": "मैंने उसका प्रस्ताव ठुकरा दिया।", "choice2": "उन्होंने मुझे उत्पाद खरीदने के लिए राजी किया।", "label": 0, "premise": "मैंने सेल्समैन की पिच पर शक किया।", "question": "effect" }
The data fields are the same among all splits.
actsa-sc.tetrain | validation | test | |
---|---|---|---|
actsa-sc.te | 4328 | 541 | 541 |
train | test | |
---|---|---|
bbca.hi | 3467 | 866 |
train | validation | test | |
---|---|---|---|
copa.en | 400 | 100 | 500 |
train | validation | test | |
---|---|---|---|
copa.gu | 362 | 88 | 448 |
train | validation | test | |
---|---|---|---|
copa.hi | 362 | 88 | 449 |
@inproceedings{kakwani-etal-2020-indicnlpsuite, title = "{I}ndic{NLPS}uite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for {I}ndian Languages", author = "Kakwani, Divyanshu and Kunchukuttan, Anoop and Golla, Satish and N.C., Gokul and Bhattacharyya, Avik and Khapra, Mitesh M. and Kumar, Pratyush", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2020.findings-emnlp.445", doi = "10.18653/v1/2020.findings-emnlp.445", pages = "4948--4961", } @inproceedings{Levesque2011TheWS, title={The Winograd Schema Challenge}, author={H. Levesque and E. Davis and L. Morgenstern}, booktitle={KR}, year={2011} }
Thanks to @sumanthd17 for adding this dataset.