数据集:
NYTK/HuSST
This is the dataset card for the Hungarian version of the Stanford Sentiment Treebank. This dataset which is also part of the Hungarian Language Understanding Evaluation Benchmark Kit HuLU . The corpus was created by translating and re-annotating the original SST (Roemmele et al., 2011).
'sentiment classification'
'sentiment scoring'
The BCP-47 code for Hungarian, the only represented language in this dataset, is hu-HU.
For each instance, there is an id, a sentence and a sentiment label.
An example:
{ "Sent_id": "dev_0", "Sent": "Nos, a Jason elment Manhattanbe és a Pokolba kapcsán, azt hiszem, az elkerülhetetlen folytatások ötletlistájáról kihúzhatunk egy űrállomást 2455-ben (hé, ne lődd le a poént).", "Label": "neutral" }
Sent_id: unique id of the instances;
Sent: the sentence, translation of an instance of the SST dataset;
Label: "negative", "neutral", or "positive".
HuSST has 3 splits: train , validation and test .
Dataset split | Number of instances in the split |
---|---|
train | 9344 |
validation | 1168 |
test | 1168 |
The test data is distributed without the labels. To evaluate your model, please contact us , or check HuLU's website for an automatic evaluation (this feature is under construction at the moment).
The data is a translation of the content of the SST dataset (only the whole sentences were used). Each sentence was translated by a human translator. Each translation was manually checked and further refined by another annotator.
The translated sentences were annotated by three human annotators with one of the following labels: negative, neutral and positive. Each sentence was then curated by a fourth annotator (the 'curator'). The final label is the decision of the curator based on the three labels of the annotators.
Who are the annotators?The translators were native Hungarian speakers with English proficiency. The annotators were university students with some linguistic background.
If you use this resource or any part of its documentation, please refer to:
Ligeti-Nagy, N., Ferenczi, G., Héja, E., Jelencsik-Mátyus, K., Laki, L. J., Vadász, N., Yang, Z. Gy. and Vadász, T. (2022) HuLU: magyar nyelvű benchmark adatbázis kiépítése a neurális nyelvmodellek kiértékelése céljából [HuLU: Hungarian benchmark dataset to evaluate neural language models]. XVIII. Magyar Számítógépes Nyelvészeti Konferencia. pp. 431–446.
@inproceedings{ligetinagy2022hulu, title={HuLU: magyar nyelvű benchmark adatbázis kiépítése a neurális nyelvmodellek kiértékelése céljából}, author={Ligeti-Nagy, N. and Ferenczi, G. and Héja, E. and Jelencsik-Mátyus, K. and Laki, L. J. and Vadász, N. and Yang, Z. Gy. and Vadász, T.}, booktitle={XVIII. Magyar Számítógépes Nyelvészeti Konferencia}, year={2022}, pages = {431--446} }
and to:
Socher et al. (2013), Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 1631--1642.
@inproceedings{socher-etal-2013-recursive, title = "Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank", author = "Socher, Richard and Perelygin, Alex and Wu, Jean and Chuang, Jason and Manning, Christopher D. and Ng, Andrew and Potts, Christopher", booktitle = "Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing", month = oct, year = "2013", address = "Seattle, Washington, USA", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/D13-1170", pages = "1631--1642", }
Thanks to lnnoemi for adding this dataset.