数据集:

ro_sent

任务:

文本分类

子任务:

sentiment-classification

语言:

计算机处理:

monolingual

大小:

10K<n<100K

语言创建人:

found

批注创建人:

found

源数据集:

original

预印本库:

arxiv:2009.08712

许可:

license:unknown

数据集介绍文件清单

中文

Dataset Card for RoSent

Dataset Summary

This dataset is a Romanian Sentiment Analysis dataset. It is present in a processed form, as used by the authors of Romanian Transformers in their examples and based on the original data present in at this GitHub repository . The original data contains product and movie reviews in Romanian.

Supported Tasks and Leaderboards

[More Information Needed]

Languages

This dataset is present in Romanian language.

Dataset Structure

Data Instances

An instance from the train split:

{'id': '0', 'label': 1, 'original_id': '0', 'sentence': 'acest document mi-a deschis cu adevarat ochii la ceea ce oamenii din afara statelor unite s-au gandit la atacurile din 11 septembrie. acest film a fost construit in mod expert si prezinta acest dezastru ca fiind mai mult decat un atac asupra pamantului american. urmarile acestui dezastru sunt previzionate din multe tari si perspective diferite. cred ca acest film ar trebui sa fie mai bine distribuit pentru acest punct. de asemenea, el ajuta in procesul de vindecare sa vada in cele din urma altceva decat stirile despre atacurile teroriste. si unele dintre piese sunt de fapt amuzante, dar nu abuziv asa. acest film a fost extrem de recomandat pentru mine, si am trecut pe acelasi sentiment.'}

Data Fields

original_id : a string feature containing the original id from the file.
id : a string feature .
sentence : a string feature.
label : a classification label, with possible values including negative (0), positive (1).

Data Splits

This dataset has two splits: train with 17941 examples, and test with 11005 examples.

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

The source dataset is present at the this GitHub repository and is based on product and movie reviews. The original source is unknown.

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

Stefan Daniel Dumitrescu, Andrei-Marious Avram, Sampo Pyysalo, @katakonst

Licensing Information

[More Information Needed]

Citation Information

@article{dumitrescu2020birth,
  title={The birth of Romanian BERT},
  author={Dumitrescu, Stefan Daniel and Avram, Andrei-Marius and Pyysalo, Sampo},
  journal={arXiv preprint arXiv:2009.08712},
  year={2020}
}

Contributions

Thanks to @gchhablani and @iliemihai for adding this dataset.

作者:

佚名

数据集大小:

12.76 KB