数据集:
NYTK/HuCOLA
子任务:
text-simplification语言:
hu计算机处理:
monolingual语言创建人:
found批注创建人:
expert-generated源数据集:
original许可:
cc-by-sa-4.0This is the dataset card for the Hungarian Corpus of Linguistic Acceptability (HuCOLA), which is also part of the Hungarian Language Understanding Evaluation Benchmark Kit HuLU .
The BCP-47 code for Hungarian, the only represented language in this dataset, is hu-HU.
For each instance, there is aN id, a sentence and a label.
An example:
{"Sent_id": "dev_0", "Sent": "A földek eláradtak.", "Label": "0"}
HuCOLA has 3 splits: train , validation and test .
Dataset split | Number of sentences in the split | Proportion of the split |
---|---|---|
train | 7276 | 80% |
validation | 900 | 10% |
test | 900 | 10% |
The test data is distributed without the labels. To evaluate your model, please contact us , or check HuLU's website for an automatic evaluation (this feature is under construction at the moment). The evaluation metric is Matthew's correlation coefficient.
The data was collected by two human annotators from 3 main linguistic books on Hungarian language:
The process of collecting sentences partly followed the one described in Warstadt et. al (2018). The guideline of our process is available in the repository of HuCOLA .
Each instance was annotated by 4 human annotators for its acceptability (see the annotation guidelines in the repository of HuCOLA ).
Who are the annotators?The annotators were native Hungarian speakers (of various ages, from 20 to 67) without any linguistic backround.
HuCOLA is released under the CC-BY-SA 4.0 licence.
If you use this resource or any part of its documentation, please refer to:
Ligeti-Nagy, N., Ferenczi, G., Héja, E., Jelencsik-Mátyus, K., Laki, L. J., Vadász, N., Yang, Z. Gy. and Váradi, T. (2022) HuLU: magyar nyelvű benchmark adatbázis kiépítése a neurális nyelvmodellek kiértékelése céljából [HuLU: Hungarian benchmark dataset to evaluate neural language models]. XVIII. Magyar Számítógépes Nyelvészeti Konferencia. (in press)
@inproceedings{ligetinagy2022hulu, title={HuLU: magyar nyelvű benchmark adatbázis kiépítése a neurális nyelvmodellek kiértékelése céljából}, author={Ligeti-Nagy, N. and Ferenczi, G. and Héja, E. and Jelencsik-Mátyus, K. and Laki, L. J. and Vadász, N. and Yang, Z. Gy. and Váradi, T.}, booktitle={XVIII. Magyar Számítógépes Nyelvészeti Konferencia}, year={2022} }
Thanks to lnnoemi for adding this dataset.