数据集:
europa_eac_tm
任务:
翻译计算机处理:
translation大小:
1K<n<10K语言创建人:
expert-generated批注创建人:
expert-generated源数据集:
original许可:
cc-by-4.0This dataset is a corpus of manually produced translations from english to up to 25 languages, released in 2012 by the European Union's Directorate General for Education and Culture (EAC).
To load a language pair that is not part of the config, just specify the language code as language pair. For example, if you want to translate Czech to Greek:
dataset = load_dataset("europa_eac_tm", language_pair=("cs", "el"))
The sentences in this dataset were originally written in English (source language is English) and then translated into the other languages. The sentences are extracted from electroniv forms: application and report forms for decentralised actions of EAC's Life-long Learning Programme (LLP) and the Youth in Action Programme. The contents in the electronic forms are technically split into two types: (a) the labels and contents of drop-down menus (referred to as 'Forms' Data) and (b) checkboxes (referred to as 'Reference Data').
The dataset contains traduction of English sentences or parts of sentences to Bulgarian, Czech, Danish, Dutch, Estonian, German, Greek, Finnish, French, Croatian, Hungarian, Icelandic, Italian, Latvian, Lithuanian, Maltese, Norwegian, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish and Turkish.
Language codes:
{ "translation": { "en":"Sentence to translate", "<target_language>": "Phrase à traduire", }, "sentence_type": 0 }
translation : Mapping of sentences to translate (in English) and translated sentences.
sentence_type : Integer value, 0 if the sentence is a 'form data' (extracted from the labels and contents of drop-down menus of the source electronic forms) or 1 if the sentence is a 'reference data' (extracted from the electronic forms checkboxes).
The data is not splitted (only the train split is available).
The EAC-TM is relatively small compared to the JRC-Acquis and to DGT-TM, but it has the advantage that it focuses on a very different domain, namely that of education and culture. Also, it includes translation units for the languages Croatian (HR), Icelandic (IS), Norwegian (Bokmål, NB or Norwegian, NO) and Turkish (TR).
EAC-TM was built in the context of translating electronic forms: application and report forms for decentralised actions of EAC's Life-long Learning Programme (LLP) and the Youth in Action Programme. All documents and sentences were originally written in English (source language is English) and then translated into the other languages.
The contents in the electronic forms are technically split into two types: (a) the labels and contents of drop-down menus (referred to as 'Forms' Data) and (b) checkboxes (referred to as 'Reference Data'). Due to the different types of data, the two collections are kept separate. For example, labels can be 'Country', 'Please specify your home country' etc., while examples for reference data are 'Germany', 'Basic/general programmes', 'Education and Culture' etc.
The data consists of translations carried out between the end of the year 2008 and July 2012.
Who are the source language producers?The texts were translated by staff of the National Agencies of the Lifelong Learning and Youth in Action programmes. They are typically professionals in the field of education/youth and EU programmes. They are thus not professional translators, but they are normally native speakers of the target language.
Sentences were manually translated by humans.
Who are the annotators?The texts were translated by staff of the National Agencies of the Lifelong Learning and Youth in Action programmes. They are typically professionals in the field of education/youth and EU programmes. They are thus not professional translators, but they are normally native speakers of the target language.
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
© European Union, 1995-2020
The Commission's reuse policy is implemented by the Commission Decision of 12 December 2011 on the reuse of Commission documents .
Unless otherwise indicated (e.g. in individual copyright notices), content owned by the EU on this website is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) licence . This means that reuse is allowed, provided appropriate credit is given and changes are indicated.
You may be required to clear additional rights if a specific content depicts identifiable private individuals or includes third-party works. To use or reproduce content that is not owned by the EU, you may need to seek permission directly from the rightholders. Software or documents covered by industrial property rights, such as patents, trade marks, registered designs, logos and names, are excluded from the Commission's reuse policy and are not licensed to you.
@Article{Steinberger2014, author={Steinberger, Ralf and Ebrahim, Mohamed and Poulis, Alexandros and Carrasco-Benitez, Manuel and Schl{\"u}ter, Patrick and Przybyszewski, Marek and Gilbro, Signe}, title={An overview of the European Union's highly multilingual parallel corpora}, journal={Language Resources and Evaluation}, year={2014}, month={Dec}, day={01}, volume={48}, number={4}, pages={679-707}, issn={1574-0218}, doi={10.1007/s10579-014-9277-0}, url={https://doi.org/10.1007/s10579-014-9277-0} }
Thanks to @SBrandeis for adding this dataset.