数据集:

ro_sts_parallel

任务:

翻译

语言:

计算机处理:

multilingual

大小:

10K<n<100K

语言创建人:

crowdsourced

批注创建人:

crowdsourced

源数据集:

extended|other-sts-b

许可:

cc-by-4.0

数据集介绍文件清单

中文

Dataset Card for RO-STS-Parallel

Dataset Summary

We present RO-STS-Parallel - a Parallel Romanian-English dataset obtained by translating the STS English dataset dataset into Romanian. It contains 17256 sentences in Romanian and English.

Supported Tasks and Leaderboards

[Needs More Information]

Languages

The text dataset is in Romanian and English ( ro , en )

Dataset Structure

Data Instances

An example looks like this:

{
  'translation': {
    'ro': 'Problema e si mai simpla.',
    'en': 'The problem is simpler than that.'
    }
}

Data Fields

translation:
- ro: text in Romanian
- en: text in English

Data Splits

The train/validation/test split contain 11,498/3,000/2,758 sentence pairs.

Dataset Creation

Curation Rationale

Source Data

Initial Data Collection and Normalization

*To construct the dataset, we first obtained automatic translations using Google's translation engine. These were then manually checked, corrected, and cross-validated by human volunteers. *

Who are the source language producers?

[Needs More Information]

Annotations

Annotation process Who are the annotators?

Personal and Sensitive Information

[Needs More Information]

Considerations for Using the Data

Social Impact of Dataset

[Needs More Information]

Discussion of Biases

[Needs More Information]

Other Known Limitations

[Needs More Information]

Additional Information

Dataset Curators

[Needs More Information]

Licensing Information

CC BY-SA 4.0 License

Citation Information

@inproceedings{dumitrescu2021liro,
  title={Liro: Benchmark and leaderboard for romanian language tasks},
  author={Dumitrescu, Stefan Daniel and Rebeja, Petru and Lorincz, Beata and Gaman, Mihaela and Avram, Andrei and Ilie, Mihai and Pruteanu, Andrei and Stan, Adriana and Rosia, Lorena and Iacobescu, Cristina and others},
  booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1)},
  year={2021}
}

Contributions

Thanks to @lorinczb for adding this dataset.

作者:

佚名

数据集大小:

16.03 KB