数据集:
NYTK/HuWNLI
This is the dataset card for the Hungarian translation of the Winograd schemata formatted as an inference task. A Winograd schema is a pair of sentences that differ in only one or two words and that contain an ambiguity that is resolved in opposite ways in the two sentences and requires the use of world knowledge and reasoning for its resolution (Levesque et al. 2012). This dataset is also part of the Hungarian Language Understanding Evaluation Benchmark Kit HuLU . The corpus was created by translating and manually curating the original English Winograd schemata. The NLI format was created by replacing the ambiguous pronoun with each possible referent (the method is described in GLUE's paper, Wang et al. 2019). We extended the set of sentence pairs derived from the schemata by the translation of the sentence pairs that - together with the Winograd schema sentences - build up the WNLI dataset of GLUE.
The BCP-47 code for Hungarian, the only represented language in this dataset, is hu-HU.
For each instance, there is an orig_id, an id, two sentences and a label.
An example:
{"orig_id": "4", "id": "4", "sentence1": "A férfi nem tudta felemelni a fiát, mert olyan nehéz volt.", "sentence2": "A fia nehéz volt.", "Label": "1" }
orig_id: the original id of this sentence pair (more precisely, its English counterpart's) in GLUE's WNLI dataset;
id: unique id of the instances;
sentence1: the premise;
sentence2: the hypothesis;
label: "1" if sentence2 is entailed by sentence1, and "0" otherwise.
The data is distributed in three splits: training set (562), development set (59) and test set (134). The splits follow GLUE's WNLI's splits but contain fewer instances as many sentence pairs had to be thrown away for being untranslatable to Hungarian. The train and the development set have been extended from nli sentence pairs formatted from the Hungarian translation of 6 Winograd schemata left out from the original WNLI dataset. The test set's sentence pairs are translated from GLUE's WNLI's test set. This set was distributed without labels. 3 annotators annotated the Hungarian sentence pairs. The test set of HuWNLI is also distributed without labels. To evaluate your model, please contact us , or check HuLU's website for an automatic evaluation (this feature is under construction at the moment).
The data is a translation of the English Winograd schemata and the additional sentence pairs of GLUE's WNLI. Each schema and sentence pair was translated by a human translator. Each schema was manually curated by a linguistic expert. The schemata were transformed into nli format by a linguistic expert.
During the adaption method, we found two erroneous labels in GLUE's WNLI's train set (id 347 and id 464). We corrected them in our dataset.
Average human performance on the test set is 92,78% (accuracy).
HuWNLI is released under the Creative Commons Attribution-ShareAlike 4.0 International License.
If you use this resource or any part of its documentation, please refer to:
Ligeti-Nagy, N., Héja, E., Laki, L. J., Takács, D., Yang, Z. Gy. and Váradi, T. (2023) Hát te mekkorát nőttél! - A HuLU első életéve új adatbázisokkal és webszolgáltatással [Look at how much you have grown! - The first year of HuLU with new databases and with webservice]. In: Berend, G., Gosztolya, G. and Vincze, V. (eds), XIX. Magyar Számítógépes Nyelvészeti Konferencia. Szeged, Szegedi Tudományegyetem, Informatikai Intézet. 217-230.
@inproceedings{ligetinagy2023hulu, title={át te mekkorát nőttél! - A HuLU első életéve új adatbázisokkal és webszolgáltatással}, author={Ligeti-Nagy, N. and Héja, E. and Laki, L. J. and Takács, D. and Yang, Z. Gy. and Váradi, T.}, booktitle={XIX. Magyar Számítógépes Nyelvészeti Konferencia}, year={2023}, editors = {Berend, Gábor and Gosztolya, Gábor and Vincze, Veronika}, address = {Szeged}, publisher = {JATEPress}, pages = {217–230} }
Ligeti-Nagy, N., Ferenczi, G., Héja, E., Jelencsik-Mátyus, K., Laki, L. J., Vadász, N., Yang, Z. Gy. and Váradi, T. (2022) HuLU: magyar nyelvű benchmark adatbázis kiépítése a neurális nyelvmodellek kiértékelése céljából [HuLU: Hungarian benchmark dataset to evaluate neural language models]. In: Berend, Gábor and Gosztolya, Gábor and Vincze, Veronika (eds), XVIII. Magyar Számítógépes Nyelvészeti Konferencia. JATEPress, Szeged. 431–446.
@inproceedings{ligetinagy2022hulu, title={HuLU: magyar nyelvű benchmark adatbázis kiépítése a neurális nyelvmodellek kiértékelése céljából}, author={Ligeti-Nagy, N. and Ferenczi, G. and Héja, E. and Jelencsik-Mátyus, K. and Laki, L. J. and Vadász, N. and Yang, Z. Gy. and Váradi, T.}, booktitle={XVIII. Magyar Számítógépes Nyelvészeti Konferencia}, year={2022}, editors = {Berend, Gábor and Gosztolya, Gábor and Vincze, Veronika}, address = {Szeged}, publisher = {JATEPress}, pages = {431–446} }
and to:
Levesque, Hector, Davis, Ernest, Morgenstern, Leora (2012) he winograd schema challenge. In: Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning.
@inproceedings{levesque2012winograd, title={The Winograd Schema Challenge}, author={Levesque, Hector and Davis, Ernest and Morgenstern, Leora}, booktitle={Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning}, year={2012}, organization={Citeseer} }
Thanks to lnnoemi for adding this dataset.