数据集:

europa_eac_tm

任务:

翻译

语言:

计算机处理:

translation

大小:

1K<n<10K

语言创建人:

expert-generated

批注创建人:

expert-generated

源数据集:

original

许可:

cc-by-4.0

数据集介绍文件清单

中文

Dataset Card for Europa Education and Culture Translation Memory (EAC-TM)

Dataset Summary

This dataset is a corpus of manually produced translations from english to up to 25 languages, released in 2012 by the European Union's Directorate General for Education and Culture (EAC).

To load a language pair that is not part of the config, just specify the language code as language pair. For example, if you want to translate Czech to Greek:

dataset = load_dataset("europa_eac_tm", language_pair=("cs", "el"))

Supported Tasks and Leaderboards

text2text-generation : the dataset can be used to train a model for machine-translation . Machine translation models are usually evaluated using metrics such as BLEU , ROUGE or SacreBLEU . You can use the mBART model for this task. This task has active leaderboards which can be found at https://paperswithcode.com/task/machine-translation , which usually rank models based on BLEU score .

Languages

The sentences in this dataset were originally written in English (source language is English) and then translated into the other languages. The sentences are extracted from electroniv forms: application and report forms for decentralised actions of EAC's Life-long Learning Programme (LLP) and the Youth in Action Programme. The contents in the electronic forms are technically split into two types: (a) the labels and contents of drop-down menus (referred to as 'Forms' Data) and (b) checkboxes (referred to as 'Reference Data').

The dataset contains traduction of English sentences or parts of sentences to Bulgarian, Czech, Danish, Dutch, Estonian, German, Greek, Finnish, French, Croatian, Hungarian, Icelandic, Italian, Latvian, Lithuanian, Maltese, Norwegian, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish and Turkish.

Language codes:

Dataset Structure

Data Instances

{
  "translation": {
    "en":"Sentence to translate",
    "<target_language>": "Phrase à traduire",
  },
  "sentence_type": 0
}

Data Fields

translation : Mapping of sentences to translate (in English) and translated sentences.
sentence_type : Integer value, 0 if the sentence is a 'form data' (extracted from the labels and contents of drop-down menus of the source electronic forms) or 1 if the sentence is a 'reference data' (extracted from the electronic forms checkboxes).

Data Splits

The data is not splitted (only the train split is available).

Dataset Creation

Curation Rationale

The EAC-TM is relatively small compared to the JRC-Acquis and to DGT-TM, but it has the advantage that it focuses on a very different domain, namely that of education and culture. Also, it includes translation units for the languages Croatian (HR), Icelandic (IS), Norwegian (Bokmål, NB or Norwegian, NO) and Turkish (TR).

Source Data

Initial Data Collection and Normalization

EAC-TM was built in the context of translating electronic forms: application and report forms for decentralised actions of EAC's Life-long Learning Programme (LLP) and the Youth in Action Programme. All documents and sentences were originally written in English (source language is English) and then translated into the other languages.

The contents in the electronic forms are technically split into two types: (a) the labels and contents of drop-down menus (referred to as 'Forms' Data) and (b) checkboxes (referred to as 'Reference Data'). Due to the different types of data, the two collections are kept separate. For example, labels can be 'Country', 'Please specify your home country' etc., while examples for reference data are 'Germany', 'Basic/general programmes', 'Education and Culture' etc.

The data consists of translations carried out between the end of the year 2008 and July 2012.

Who are the source language producers?

The texts were translated by staff of the National Agencies of the Lifelong Learning and Youth in Action programmes. They are typically professionals in the field of education/youth and EU programmes. They are thus not professional translators, but they are normally native speakers of the target language.

Annotations

Annotation process

Sentences were manually translated by humans.

Who are the annotators?

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

The Commission's reuse policy is implemented by the Commission Decision of 12 December 2011 on the reuse of Commission documents .

Unless otherwise indicated (e.g. in individual copyright notices), content owned by the EU on this website is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) licence . This means that reuse is allowed, provided appropriate credit is given and changes are indicated.

You may be required to clear additional rights if a specific content depicts identifiable private individuals or includes third-party works. To use or reproduce content that is not owned by the EU, you may need to seek permission directly from the rightholders. Software or documents covered by industrial property rights, such as patents, trade marks, registered designs, logos and names, are excluded from the Commission's reuse policy and are not licensed to you.

Citation Information

@Article{Steinberger2014,
        author={Steinberger, Ralf
                and Ebrahim, Mohamed
                and Poulis, Alexandros
                and Carrasco-Benitez, Manuel
                and Schl{\"u}ter, Patrick
                and Przybyszewski, Marek
                and Gilbro, Signe},
        title={An overview of the European Union's highly multilingual parallel corpora},
        journal={Language Resources and Evaluation},
        year={2014},
        month={Dec},
        day={01},
        volume={48},
        number={4},
        pages={679-707},
        issn={1574-0218},
        doi={10.1007/s10579-014-9277-0},
        url={https://doi.org/10.1007/s10579-014-9277-0}
}

Contributions

Thanks to @SBrandeis for adding this dataset.

作者:

佚名

数据集大小:

99.5 KB