数据集:

qanastek/ECDC

任务:

翻译

语言:

en

计算机处理:

en-sv en-pl en-hu

大小:

100K<n<1M

语言创建人:

found

源数据集:

extended

许可:

other
中文

ECDC : An overview of the European Union's highly multilingual parallel corpora

Dataset Summary

In October 2012, the European Union (EU) agency 'European Centre for Disease Prevention and Control' (ECDC) released a translation memory (TM), i.e. a collection of sentences and their professionally produced translations, in twenty-five languages. The data gets distributed via the web pages of the EC's Joint Research Centre (JRC) .

Supported Tasks and Leaderboards

translation : The dataset can be used to train a model for translation.

Languages

In our case, the corpora consists of a pair of source and target sentences for all 22 different languages from the European Union (EU).

List of languages : English (en) , Swedish (sv) , Polish (pl) , Hungarian (hu) , Lithuanian (lt) , Latvian (lv) , German (de) , Finnish (fi) , Slovak (sk) , Slovenian (sl) , French (fr) , , Czech (cs) , Danish (da) , Italian (it) , Maltese (mt) , Dutch (nl) , Portuguese (pt) , Romanian (ro) , Spanish (es) , Estonian (et) , Bulgarian (bg) , Greek (el) , Irish (ga) , Icelandic (is) and Norwegian (no) .

Load the dataset with HuggingFace

from datasets import load_dataset
dataset = load_dataset("qanastek/ECDC", "en-it", split='train', download_mode='force_redownload')
print(dataset)
print(dataset[0])

Dataset Structure

Data Instances

key,lang,source_text,target_text
doc_0,en-bg,Vaccination against hepatitis C is not yet available.,Засега няма ваксина срещу хепатит С.
doc_1355,en-bg,Varicella infection,Инфекция с варицела
doc_2349,en-bg,"If you have any questions about the processing of your e-mail and related personal data, do not hesitate to include them in your message.","Ако имате въпроси относно обработката на вашия адрес на електронна поща и свързаните лични данни, не се колебайте да ги включите в съобщението си."
doc_192,en-bg,Transmission can be reduced especially by improving hygiene in food production handling.,Предаването на инфекцията може да бъде ограничено особено чрез подобряване на хигиената при манипулациите в хранителната индустрия.

Data Fields

key : The document identifier String .

lang : The pair of source and target language of type String .

source_text : The source text of type String .

target_text : The target text of type String .

Data Splits

lang key
en-bg 2567
en-cs 2562
en-da 2577
en-de 2560
en-el 2530
en-es 2564
en-et 2581
en-fi 2617
en-fr 2561
en-ga 1356
en-hu 2571
en-is 2511
en-it 2534
en-lt 2545
en-lv 2542
en-mt 2539
en-nl 2510
en-no 2537
en-pl 2546
en-pt 2531
en-ro 2555
en-sk 2525
en-sl 2545
en-sv 2527

Dataset Creation

Curation Rationale

For details, check the corresponding pages .

Source Data

Who are the source language producers?

Every data of this corpora as been uploaded by on JRC .

Personal and Sensitive Information

The corpora is free of personal or sensitive information.

Considerations for Using the Data

Other Known Limitations

The nature of the task introduce a variability in the quality of the target translations.

Additional Information

Dataset Curators

Hugging Face ECDC : Labrak Yanis, Dufour Richard (Not affiliated with the original corpus)

An overview of the European Union's highly multilingual parallel corpora : Steinberger Ralf, Mohamed Ebrahim, Alexandros Poulis, Manuel Carrasco-Benitez, Patrick Schlüter, Marek Przybyszewski & Signe Gilbro.

Licensing Information

By downloading or using the ECDC-Translation Memory, you are bound by the ECDC-TM usage conditions (PDF) .

No Warranty

Each Work is provided ‘as is’ without, to the full extent permitted by law, representations, warranties, obligations and liabilities of any kind, either express or implied, including, but not limited to, any implied warranty of merchantability, integration, satisfactory quality and fitness for a particular purpose.

Except in the cases of wilful misconduct or damages directly caused to natural persons, the Owner will not be liable for any incidental, consequential, direct or indirect damages, including, but not limited to, the loss of data, lost profits or any other financial loss arising from the use of, or inability to use, the Work even if the Owner has been notified of the possibility of such loss, damages, claims or costs, or for any claim by any third party. The Owner may be liable under national statutory product liability laws as far as such laws apply to the Work.

Citation Information

Please cite the following paper when using this dataset.

@article{10.1007/s10579-014-9277-0,
  author = {Steinberger, Ralf and Ebrahim, Mohamed and Poulis, Alexandros and Carrasco-Benitez, Manuel and Schl\"{u}ter, Patrick and Przybyszewski, Marek and Gilbro, Signe},
  title = {An Overview of the European Union's Highly Multilingual Parallel Corpora},
  year = {2014},
  issue_date = {December  2014},
  publisher = {Springer-Verlag},
  address = {Berlin, Heidelberg},
  volume = {48},
  number = {4},
  issn = {1574-020X},
  url = {https://doi.org/10.1007/s10579-014-9277-0},
  doi = {10.1007/s10579-014-9277-0},
  abstract = {Starting in 2006, the European Commission's Joint Research Centre and other European Union organisations have made available a number of large-scale highly-multilingual parallel language resources. In this article, we give a comparative overview of these resources and we explain the specific nature of each of them. This article provides answers to a number of question, including: What are these linguistic resources? What is the difference between them? Why were they originally created and why was the data released publicly? What can they be used for and what are the limitations of their usability? What are the text types, subject domains and languages covered? How to avoid overlapping document sets? How do they compare regarding the formatting and the translation alignment? What are their usage conditions? What other types of multilingual linguistic resources does the EU have? This article thus aims to clarify what the similarities and differences between the various resources are and what they can be used for. It will also serve as a reference publication for those resources, for which a more detailed description has been lacking so far (EAC-TM, ECDC-TM and DGT-Acquis).},
  journal = {Lang. Resour. Eval.},
  month = {dec},
  pages = {679–707},
  numpages = {29},
  keywords = {DCEP, EAC-TM, EuroVoc, JRC EuroVoc Indexer JEX, Parallel corpora, DGT-TM, Eur-Lex, Highly multilingual, Linguistic resources, DGT-Acquis, European Union, ECDC-TM, JRC-Acquis, Translation memory}
}