数据集:

ro_sts

语言:

ro

计算机处理:

monolingual

大小:

1K<n<10K

语言创建人:

crowdsourced

批注创建人:

crowdsourced

许可:

cc-by-4.0
中文

Dataset Card for RO-STS

Dataset Summary

We present RO-STS - the Semantic Textual Similarity dataset for the Romanian language. It is a high-quality translation of the STS English dataset . RO-STS contains 8,628 sentence pairs with their similarity scores. The original English sentences were collected from news headlines, captions of images and user forums, and are categorized accordingly. The Romanian release follows this categorization and provides the same train/validation/test split with 5,749/1,500/1,379 sentence pairs in each subset.

Supported Tasks and Leaderboards

[Needs More Information]

Languages

The text dataset is in Romanian ( ro )

Dataset Structure

Data Instances

An example looks like this:

{'score': 1.5,
 'sentence1': 'Un bărbat cântă la harpă.',
 'sentence2': 'Un bărbat cântă la claviatură.',
}

Data Fields

  • score : a float representing the semantic similarity score where 0.0 is the lowest score and 5.0 is the highest
  • sentence1 : a string representing a text
  • sentence2 : another string to compare the previous text with

Data Splits

The train/validation/test split contain 5,749/1,500/1,379 sentence pairs.

Dataset Creation

Curation Rationale

[Needs More Information]

Source Data

[Needs More Information]

Initial Data Collection and Normalization

*To construct the dataset, we first obtained automatic translations using Google's translation engine. These were then manually checked, corrected, and cross-validated by human volunteers. *

Who are the source language producers?

[Needs More Information]

Annotations

Annotation process Who are the annotators?

Personal and Sensitive Information

[Needs More Information]

Considerations for Using the Data

Social Impact of Dataset

[Needs More Information]

Discussion of Biases

[Needs More Information]

Other Known Limitations

[Needs More Information]

Additional Information

Dataset Curators

[Needs More Information]

Licensing Information

CC BY-SA 4.0 License

Citation Information

@inproceedings{dumitrescu2021liro,
  title={Liro: Benchmark and leaderboard for romanian language tasks},
  author={Dumitrescu, Stefan Daniel and Rebeja, Petru and Lorincz, Beata and Gaman, Mihaela and Avram, Andrei and Ilie, Mihai and Pruteanu, Andrei and Stan, Adriana and Rosia, Lorena and Iacobescu, Cristina and others},
  booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1)},
  year={2021}
}

Contributions

Thanks to @lorinczb for adding this dataset.