数据集:

opus_dgt

任务:

翻译

语言:

计算机处理:

multilingual

大小:

100K<n<1M 10K<n<100K 1M<n<10M

语言创建人:

found

批注创建人:

found

源数据集:

original

许可:

license:unknown

数据集介绍文件清单

中文

Dataset Card for OpusDgt

Dataset Summary

A collection of translation memories provided by the Joint Research Centre (JRC) Directorate-General for Translation (DGT): https://ec.europa.eu/jrc/en/language-technologies/dgt-translation-memory

Tha dataset contains 25 languages and 299 bitexts.

To load a language pair which isn't part of the config, all you need to do is specify the language code as pairs, e.g.

dataset = load_dataset("opus_dgt", lang1="it", lang2="pl")

You can find the valid pairs in Homepage section of Dataset Description: http://opus.nlpl.eu/DGT.php

Supported Tasks and Leaderboards

[More Information Needed]

Languages

The languages in the dataset are:

Dataset Structure

Data Instances

{
  'id': '0', 
  'translation': {
    "bg": "Протокол за поправка на Конвенцията относно компетентността, признаването и изпълнението на съдебни решения по граждански и търговски дела, подписана в Лугано на 30 октомври 2007 г.",
    "ga": "Miontuairisc cheartaitheach maidir le Coinbhinsiún ar dhlínse agus ar aithint agus ar fhorghníomhú breithiúnas in ábhair shibhialta agus tráchtála, a siníodh in Lugano an 30 Deireadh Fómhair 2007"
  }
}

Data Fields

id ( str ): Unique identifier of the parallel sentence for the pair of languages.
translation ( dict ): Parallel sentences for the pair of languages.

Data Splits

The dataset contains a single train split.

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

[More Information Needed]

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

[More Information Needed]

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

[More Information Needed]

Citation Information

@InProceedings{TIEDEMANN12.463,
  author = {J{\"o}rg Tiedemann},
  title = {Parallel Data, Tools and Interfaces in OPUS},
  booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)},
  year = {2012},
  month = {may},
  date = {23-25},
  address = {Istanbul, Turkey},
  editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Ugur Dogan and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {978-2-9517408-7-7},
  language = {english}
}

Contributions

Thanks to @rkc007 for adding this dataset.

作者:

佚名

数据集大小:

30.78 KB