数据集:
ro_sts
任务:
文本分类语言:
ro计算机处理:
monolingual大小:
1K<n<10K语言创建人:
crowdsourced批注创建人:
crowdsourced源数据集:
extended|other-sts-b许可:
cc-by-4.0We present RO-STS - the Semantic Textual Similarity dataset for the Romanian language. It is a high-quality translation of the STS English dataset . RO-STS contains 8,628 sentence pairs with their similarity scores. The original English sentences were collected from news headlines, captions of images and user forums, and are categorized accordingly. The Romanian release follows this categorization and provides the same train/validation/test split with 5,749/1,500/1,379 sentence pairs in each subset.
[Needs More Information]
The text dataset is in Romanian ( ro )
An example looks like this:
{'score': 1.5, 'sentence1': 'Un bărbat cântă la harpă.', 'sentence2': 'Un bărbat cântă la claviatură.', }
The train/validation/test split contain 5,749/1,500/1,379 sentence pairs.
[Needs More Information]
[Needs More Information]
Initial Data Collection and Normalization*To construct the dataset, we first obtained automatic translations using Google's translation engine. These were then manually checked, corrected, and cross-validated by human volunteers. *
Who are the source language producers?[Needs More Information]
[Needs More Information]
[Needs More Information]
[Needs More Information]
[Needs More Information]
[Needs More Information]
CC BY-SA 4.0 License
@inproceedings{dumitrescu2021liro, title={Liro: Benchmark and leaderboard for romanian language tasks}, author={Dumitrescu, Stefan Daniel and Rebeja, Petru and Lorincz, Beata and Gaman, Mihaela and Avram, Andrei and Ilie, Mihai and Pruteanu, Andrei and Stan, Adriana and Rosia, Lorena and Iacobescu, Cristina and others}, booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1)}, year={2021} }
Thanks to @lorinczb for adding this dataset.