数据集:

qanastek/ELRC-Medical-V2

任务:

翻译

计算机处理:

multilingual

大小:

100K<n<1M

语言创建人:

found

源数据集:

extended
中文

ELRC-Medical-V2 : European parallel corpus for healthcare machine translation

Dataset Summary

ELRC-Medical-V2 is a parallel corpus for neural machine translation funded by the European Commission and coordinated by the German Research Center for Artificial Intelligence .

Supported Tasks and Leaderboards

translation : The dataset can be used to train a model for translation.

Languages

In our case, the corpora consists of a pair of source and target sentences for 23 differents languages from the European Union (EU) with as source language in each cases english (EN).

List of languages : Bulgarian (bg) , Czech (cs) , Danish (da) , German (de) , Greek (el) , Spanish (es) , Estonian (et) , Finnish (fi) , French (fr) , Irish (ga) , Croatian (hr) , Hungarian (hu) , Italian (it) , Lithuanian (lt) , Latvian (lv) , Maltese (mt) , Dutch (nl) , Polish (pl) , Portuguese (pt) , Romanian (ro) , Slovak (sk) , Slovenian (sl) , Swedish (sv) .

Load the dataset with HuggingFace

from datasets import load_dataset

NAME = "qanastek/ELRC-Medical-V2"

dataset = load_dataset(NAME, use_auth_token=True)
print(dataset)

dataset_train = load_dataset(NAME, "en-es", split='train[:90%]')
dataset_test = load_dataset(NAME, "en-es", split='train[10%:]')
print(dataset_train)
print(dataset_train[0])
print(dataset_test)

Dataset Structure

Data Instances

id,lang,source_text,target_text
1,en-bg,"TOC \o ""1-3"" \h \z \u Introduction 3","TOC \o ""1-3"" \h \z \u Въведение 3"
2,en-bg,The international humanitarian law and its principles are often not respected.,Международното хуманитарно право и неговите принципи често не се зачитат.
3,en-bg,"At policy level, progress was made on several important initiatives.",На равнище политики напредък е постигнат по няколко важни инициативи.

Data Fields

id : The document identifier of type Integer .

lang : The pair of source and target language of type String .

source_text : The source text of type String .

target_text : The target text of type String .

Data Splits

Lang # Docs Avg. # Source Tokens Avg. # Target Tokens
bg 13 149 23 24
cs 13 160 23 21
da 13 242 23 22
de 13 291 23 22
el 13 091 23 26
es 13 195 23 28
et 13 016 23 17
fi 12 942 23 16
fr 13 149 23 28
ga 412 12 12
hr 12 836 23 21
hu 13 025 23 21
it 13 059 23 25
lt 12 580 23 18
lv 13 044 23 19
mt 3 093 16 14
nl 13 191 23 25
pl 12 761 23 22
pt 13 148 23 26
ro 13 163 23 25
sk 12 926 23 20
sl 13 208 23 21
sv 13 099 23 21
Total 277 780 22.21 21.47

Dataset Creation

Curation Rationale

For details, check the corresponding pages .

Source Data

Initial Data Collection and Normalization

The acquisition of bilingual data (from multilingual websites), normalization, cleaning, deduplication and identification of parallel documents have been done by ILSP-FC tool . Maligna aligner was used for alignment of segments. Merging/filtering of segment pairs has also been applied.

Who are the source language producers?

Every data of this corpora as been uploaded by Vassilis Papavassiliou on ELRC-Share .

Personal and Sensitive Information

The corpora is free of personal or sensitive information.

Considerations for Using the Data

Other Known Limitations

The nature of the task introduce a variability in the quality of the target translations.

Additional Information

Dataset Curators

ELRC-Medical-V2 : Labrak Yanis, Dufour Richard

Bilingual corpus from the Publications Office of the EU on the medical domain v.2 (EN-XX) Corpus : Vassilis Papavassiliou and others .

Licensing Information

This work is licensed under a Attribution 4.0 International (CC BY 4.0) License .

Citation Information

Please cite the following paper when using this model.

@inproceedings{losch-etal-2018-european,
    title = European Language Resource Coordination: Collecting Language Resources for Public Sector Multilingual Information Management,
    author = {
      L'osch, Andrea  and
      Mapelli, Valérie  and
      Piperidis, Stelios  and
      Vasiljevs, Andrejs  and
      Smal, Lilli  and
      Declerck, Thierry  and
      Schnur, Eileen  and
      Choukri, Khalid  and
      van Genabith, Josef
    },
    booktitle = Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018),
    month = may,
    year = 2018,
    address = Miyazaki, Japan,
    publisher = European Language Resources Association (ELRA),
    url = https://aclanthology.org/L18-1213,
}