数据集:
NYTK/HuCoPA
This is the dataset card for the Hungarian Choice of Plausible Alternatives Corpus (HuCoPA), which is also part of the Hungarian Language Understanding Evaluation Benchmark Kit HuLU . The corpus was created by translating and re-annotating the original English CoPA corpus (Roemmele et al., 2011).
'commonsense reasoning' 'question answering'
The BCP-47 code for Hungarian, the only represented language in this dataset, is hu-HU.
For each instance, there is an id, a premise, a question ('cause' or 'effect'), two alternatives and a label (1 or 2).
An example:
{"idx": "1", "question": "cause", "label": "1", "premise": "A testem árnyékot vetett a fűre.", "choice1": "Felkelt a nap.", "choice2": "A füvet lenyírták."}
HuCoPA has 3 splits: train , validation and test .
Dataset split | Number of instances in the split |
---|---|
train | 400 |
validation | 100 |
test | 500 |
The test data is distributed without the labels. To evaluate your model, please contact us , or check HuLU's website for an automatic evaluation (this feature is under construction at the moment).
The data is a translation of the content of the CoPA corpus. Each sentence was translated by a human translator. Each translation was manually checked and further refined by another annotator.
The instances initially inherited their original labels from the CoPA dataset. Each instance was annotated by a human annotator. If the original label and the human annotator's label did not match, we manually curated the instance and assigned a final label to that. This step was necessary to ensure that the causal realationship had not been changed or lost during the translation process.
Who are the annotators?The translators were native Hungarian speakers with English proficiency. The annotators were university students with some linguistic background.
The human performance on the test set is 96% (accuracy).
HuCoPA is released under the BSD 2-Clause License.
Copyright (c) 2010, University of Southern California All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
If you use this resource or any part of its documentation, please refer to:
Ligeti-Nagy, N., Ferenczi, G., Héja, E., Jelencsik-Mátyus, K., Laki, L. J., Vadász, N., Yang, Z. Gy. and Váradi, T. (2022) HuLU: magyar nyelvű benchmark adatbázis kiépítése a neurális nyelvmodellek kiértékelése céljából [HuLU: Hungarian benchmark dataset to evaluate neural language models]. In: Berend, Gábor and Gosztolya, Gábor and Vincze, Veronika (eds), XVIII. Magyar Számítógépes Nyelvészeti Konferencia. JATEPress, Szeged. 431–446.
@inproceedings{ligetinagy2022hulu, title={HuLU: magyar nyelvű benchmark adatbázis kiépítése a neurális nyelvmodellek kiértékelése céljából}, author={Ligeti-Nagy, N. and Ferenczi, G. and Héja, E. and Jelencsik-Mátyus, K. and Laki, L. J. and Vadász, N. and Yang, Z. Gy. and Váradi, T.}, booktitle={XVIII. Magyar Számítógépes Nyelvészeti Konferencia}, year={2022}, editors = {Berend, Gábor and Gosztolya, Gábor and Vincze, Veronika}, address = {Szeged}, publisher = {JATEPress}, pages = {431–446} }
and to:
Roemmele, M., Bejan, C., and Gordon, A. (2011) Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning. AAAI Spring Symposium on Logical Formalizations of Commonsense Reasoning, Stanford University, March 21-23, 2011.
@inproceedings{roemmele2011choice, title={Choice of plausible alternatives: An evaluation of commonsense causal reasoning}, author={Roemmele, Melissa and Bejan, Cosmin Adrian and Gordon, Andrew S}, booktitle={2011 AAAI Spring Symposium Series}, year={2011}, url={https://people.ict.usc.edu/~gordon/publications/AAAI-SPRING11A.PDF}, }
Thanks to lnnoemi for adding this dataset.