数据集:

NYTK/HuWNLI

任务:

task_categories:other

子任务:

coreference-resolution

语言:

计算机处理:

monolingual

大小:

size_categories:unknown

语言创建人:

found expert-generated

批注创建人:

found

源数据集:

extended|other

其他:

structure-prediction

许可:

cc-by-sa-4.0

数据集介绍文件清单

中文

Dataset Card for HuWNLI

Dataset Summary

This is the dataset card for the Hungarian translation of the Winograd schemata formatted as an inference task. A Winograd schema is a pair of sentences that differ in only one or two words and that contain an ambiguity that is resolved in opposite ways in the two sentences and requires the use of world knowledge and reasoning for its resolution (Levesque et al. 2012). This dataset is also part of the Hungarian Language Understanding Evaluation Benchmark Kit HuLU . The corpus was created by translating and manually curating the original English Winograd schemata. The NLI format was created by replacing the ambiguous pronoun with each possible referent (the method is described in GLUE's paper, Wang et al. 2019). We extended the set of sentence pairs derived from the schemata by the translation of the sentence pairs that - together with the Winograd schema sentences - build up the WNLI dataset of GLUE.

Languages

The BCP-47 code for Hungarian, the only represented language in this dataset, is hu-HU.

Dataset Structure

Data Instances

For each instance, there is an orig_id, an id, two sentences and a label.

An example:

{"orig_id": "4",
 "id": "4",
 "sentence1": "A férfi nem tudta felemelni a fiát, mert olyan nehéz volt.",
 "sentence2": "A fia nehéz volt.",
 "Label": "1"
}

Data Fields

orig_id: the original id of this sentence pair (more precisely, its English counterpart's) in GLUE's WNLI dataset;
id: unique id of the instances;
sentence1: the premise;
sentence2: the hypothesis;
label: "1" if sentence2 is entailed by sentence1, and "0" otherwise.

Data Splits

The data is distributed in three splits: training set (562), development set (59) and test set (134). The splits follow GLUE's WNLI's splits but contain fewer instances as many sentence pairs had to be thrown away for being untranslatable to Hungarian. The train and the development set have been extended from nli sentence pairs formatted from the Hungarian translation of 6 Winograd schemata left out from the original WNLI dataset. The test set's sentence pairs are translated from GLUE's WNLI's test set. This set was distributed without labels. 3 annotators annotated the Hungarian sentence pairs. The test set of HuWNLI is also distributed without labels. To evaluate your model, please contact us , or check HuLU's website for an automatic evaluation (this feature is under construction at the moment).

Dataset Creation

Source Data

Initial Data Collection and Normalization

The data is a translation of the English Winograd schemata and the additional sentence pairs of GLUE's WNLI. Each schema and sentence pair was translated by a human translator. Each schema was manually curated by a linguistic expert. The schemata were transformed into nli format by a linguistic expert.

During the adaption method, we found two erroneous labels in GLUE's WNLI's train set (id 347 and id 464). We corrected them in our dataset.

Additional Information

Average human performance on the test set is 92,78% (accuracy).

Licensing Information

HuWNLI is released under the Creative Commons Attribution-ShareAlike 4.0 International License.

Citation Information

If you use this resource or any part of its documentation, please refer to:

Ligeti-Nagy, N., Héja, E., Laki, L. J., Takács, D., Yang, Z. Gy. and Váradi, T. (2023) Hát te mekkorát nőttél! - A HuLU első életéve új adatbázisokkal és webszolgáltatással [Look at how much you have grown! - The first year of HuLU with new databases and with webservice]. In: Berend, G., Gosztolya, G. and Vincze, V. (eds), XIX. Magyar Számítógépes Nyelvészeti Konferencia. Szeged, Szegedi Tudományegyetem, Informatikai Intézet. 217-230.

@inproceedings{ligetinagy2023hulu,
  title={át te mekkorát nőttél! - A HuLU első életéve új adatbázisokkal és webszolgáltatással},
  author={Ligeti-Nagy, N. and Héja, E. and Laki, L. J. and Takács, D. and Yang, Z. Gy. and Váradi, T.},
  booktitle={XIX. Magyar Számítógépes Nyelvészeti Konferencia},
  year={2023},
  editors = {Berend, Gábor and Gosztolya, Gábor and Vincze, Veronika},
  address = {Szeged},
  publisher = {JATEPress},
  pages = {217–230}
}

Ligeti-Nagy, N., Ferenczi, G., Héja, E., Jelencsik-Mátyus, K., Laki, L. J., Vadász, N., Yang, Z. Gy. and Váradi, T. (2022) HuLU: magyar nyelvű benchmark adatbázis kiépítése a neurális nyelvmodellek kiértékelése céljából [HuLU: Hungarian benchmark dataset to evaluate neural language models]. In: Berend, Gábor and Gosztolya, Gábor and Vincze, Veronika (eds), XVIII. Magyar Számítógépes Nyelvészeti Konferencia. JATEPress, Szeged. 431–446.

@inproceedings{ligetinagy2022hulu,
  title={HuLU: magyar nyelvű benchmark adatbázis kiépítése a neurális nyelvmodellek kiértékelése céljából},
  author={Ligeti-Nagy, N. and Ferenczi, G. and Héja, E. and Jelencsik-Mátyus, K. and Laki, L. J. and Vadász, N. and Yang, Z. Gy. and Váradi, T.},
  booktitle={XVIII. Magyar Számítógépes Nyelvészeti Konferencia},
  year={2022},
  editors = {Berend, Gábor and Gosztolya, Gábor and Vincze, Veronika},
  address = {Szeged},
  publisher = {JATEPress},
  pages = {431–446}
}

and to:

Levesque, Hector, Davis, Ernest, Morgenstern, Leora (2012) he winograd schema challenge. In: Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning.

@inproceedings{levesque2012winograd,
  title={The Winograd Schema Challenge},
  author={Levesque, Hector and Davis, Ernest and Morgenstern, Leora},
  booktitle={Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning},
  year={2012},
  organization={Citeseer}
}

Contributions

Thanks to lnnoemi for adding this dataset.

作者:

NYTK

数据集大小:

206.66 KB