数据集:

qanastek/ELRC-Medical-V2

任务:

翻译

语言:

计算机处理:

multilingual

大小:

100K<n<1M

语言创建人:

found

批注创建人:

machine-generated expert-generated

源数据集:

extended

数据集介绍文件清单

中文

ELRC-Medical-V2 : European parallel corpus for healthcare machine translation

Dataset Summary

ELRC-Medical-V2 is a parallel corpus for neural machine translation funded by the European Commission and coordinated by the German Research Center for Artificial Intelligence .

Supported Tasks and Leaderboards

translation : The dataset can be used to train a model for translation.

Languages

In our case, the corpora consists of a pair of source and target sentences for 23 differents languages from the European Union (EU) with as source language in each cases english (EN).

List of languages : Bulgarian (bg) , Czech (cs) , Danish (da) , German (de) , Greek (el) , Spanish (es) , Estonian (et) , Finnish (fi) , French (fr) , Irish (ga) , Croatian (hr) , Hungarian (hu) , Italian (it) , Lithuanian (lt) , Latvian (lv) , Maltese (mt) , Dutch (nl) , Polish (pl) , Portuguese (pt) , Romanian (ro) , Slovak (sk) , Slovenian (sl) , Swedish (sv) .

Load the dataset with HuggingFace

from datasets import load_dataset

NAME = "qanastek/ELRC-Medical-V2"

dataset = load_dataset(NAME, use_auth_token=True)
print(dataset)

dataset_train = load_dataset(NAME, "en-es", split='train[:90%]')
dataset_test = load_dataset(NAME, "en-es", split='train[10%:]')
print(dataset_train)
print(dataset_train[0])
print(dataset_test)

Dataset Structure

Data Instances

id,lang,source_text,target_text
1,en-bg,"TOC \o ""1-3"" \h \z \u Introduction 3","TOC \o ""1-3"" \h \z \u Въведение 3"
2,en-bg,The international humanitarian law and its principles are often not respected.,Международното хуманитарно право и неговите принципи често не се зачитат.
3,en-bg,"At policy level, progress was made on several important initiatives.",На равнище политики напредък е постигнат по няколко важни инициативи.

Data Fields

id : The document identifier of type Integer .

lang : The pair of source and target language of type String .

source_text : The source text of type String .

target_text : The target text of type String .

Data Splits

Lang	# Docs	Avg. # Source Tokens	Avg. # Target Tokens
bg	13 149	23	24
cs	13 160	23	21
da	13 242	23	22
de	13 291	23	22
el	13 091	23	26
es	13 195	23	28
et	13 016	23	17
fi	12 942	23	16
fr	13 149	23	28
ga	412	12	12
hr	12 836	23	21
hu	13 025	23	21
it	13 059	23	25
lt	12 580	23	18
lv	13 044	23	19
mt	3 093	16	14
nl	13 191	23	25
pl	12 761	23	22
pt	13 148	23	26
ro	13 163	23	25
sk	12 926	23	20
sl	13 208	23	21
sv	13 099	23	21
Total	277 780	22.21	21.47

Dataset Creation

Curation Rationale

For details, check the corresponding pages .

Source Data

Initial Data Collection and Normalization

The acquisition of bilingual data (from multilingual websites), normalization, cleaning, deduplication and identification of parallel documents have been done by ILSP-FC tool . Maligna aligner was used for alignment of segments. Merging/filtering of segment pairs has also been applied.

Who are the source language producers?

Every data of this corpora as been uploaded by Vassilis Papavassiliou on ELRC-Share .

Personal and Sensitive Information

The corpora is free of personal or sensitive information.

Considerations for Using the Data

Other Known Limitations

The nature of the task introduce a variability in the quality of the target translations.

Additional Information

Dataset Curators

ELRC-Medical-V2 : Labrak Yanis, Dufour Richard

Bilingual corpus from the Publications Office of the EU on the medical domain v.2 (EN-XX) Corpus : Vassilis Papavassiliou and others .

Licensing Information

This work is licensed under a Attribution 4.0 International (CC BY 4.0) License .

Citation Information

Please cite the following paper when using this model.

@inproceedings{losch-etal-2018-european,
    title = European Language Resource Coordination: Collecting Language Resources for Public Sector Multilingual Information Management,
    author = {
      L'osch, Andrea  and
      Mapelli, Valérie  and
      Piperidis, Stelios  and
      Vasiljevs, Andrejs  and
      Smal, Lilli  and
      Declerck, Thierry  and
      Schnur, Eileen  and
      Choukri, Khalid  and
      van Genabith, Josef
    },
    booktitle = Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018),
    month = may,
    year = 2018,
    address = Miyazaki, Japan,
    publisher = European Language Resources Association (ELRA),
    url = https://aclanthology.org/L18-1213,
}

作者:

qanastek

数据集大小:

120.7 MB