数据集:

qanastek/WMT-16-PubMed

任务:

翻译

语言:

计算机处理:

multilingual

大小:

100K<n<1M

语言创建人:

found

批注创建人:

machine-generated expert-generated

源数据集:

extended

数据集介绍文件清单

中文

WMT-16-PubMed : European parallel translation corpus from the European Medicines Agency

Dataset Summary

WMT-16-PubMed is a parallel corpus for neural machine translation collected and aligned for ACL 2016 during the WMT'16 Shared Task: Biomedical Translation Task .

Supported Tasks and Leaderboards

translation : The dataset can be used to train a model for translation.

Languages

The corpora consists of a pair of source and target sentences for all 4 different languages :

List of languages : English (en) , Spanish (es) , French (fr) , Portuguese (pt) .

Load the dataset with HuggingFace

from datasets import load_dataset
dataset = load_dataset("qanastek/WMT-16-PubMed", split='train', download_mode='force_redownload')
print(dataset)
print(dataset[0])

Dataset Structure

Data Instances

lang doc_id workshop publisher source_text target_text
0 en-fr 26839447 WMT'16 Biomedical Translation Task - PubMed pubmed Global Health: Where Do Physiotherapy and Reha... La place des cheveux et des poils dans les rit...
1 en-fr 26837117 WMT'16 Biomedical Translation Task - PubMed pubmed Carabin Les Carabins
2 en-fr 26837116 WMT'16 Biomedical Translation Task - PubMed pubmed In Process Citation Le laboratoire d'Anatomie, Biomécanique et Org...
3 en-fr 26837115 WMT'16 Biomedical Translation Task - PubMed pubmed Comment on the misappropriation of bibliograph... Du détournement des références bibliographique...
4 en-fr 26837114 WMT'16 Biomedical Translation Task - PubMed pubmed Anti-aging medicine, a science-based, essentia... La médecine anti-âge, une médecine scientifiqu...
... ... ... ... ... ... ...
973972 en-pt 20274330 WMT'16 Biomedical Translation Task - PubMed pubmed Myocardial infarction, diagnosis and treatment Infarto do miocárdio; diagnóstico e tratamento
973973 en-pt 20274329 WMT'16 Biomedical Translation Task - PubMed pubmed The health areas politics A política dos campos de saúde
973974 en-pt 20274328 WMT'16 Biomedical Translation Task - PubMed pubmed The role in tissue edema and liquid exchanges ... O papel dos tecidos nos edemas e nas trocas lí...
973975 en-pt 20274327 WMT'16 Biomedical Translation Task - PubMed pubmed About suppuration of the wound after thoracopl... Sôbre as supurações da ferida operatória após ...
973976 en-pt 20274326 WMT'16 Biomedical Translation Task - PubMed pubmed Experimental study of liver lesions in the tre... Estudo experimental das lesões hepáticas no tr...

Data Fields

lang : The pair of source and target language of type String .

source_text : The source text of type String .

target_text : The target text of type String .

Data Splits

en-es : 285,584

en-fr : 614,093

en-pt : 74,300

Dataset Creation

Curation Rationale

For details, check the corresponding pages .

Source Data

Who are the source language producers?

The shared task as been organized by :

Antonio Jimeno Yepes (IBM Research Australia)
Aurélie Névéol (LIMSI, CNRS, France)
Mariana Neves (Hasso-Plattner Institute, Germany)
Karin Verspoor (University of Melbourne, Australia)

Personal and Sensitive Information

The corpora is free of personal or sensitive information.

Considerations for Using the Data

Other Known Limitations

The nature of the task introduce a variability in the quality of the target translations.

Additional Information

Dataset Curators

Hugging Face WMT-16-PubMed : Labrak Yanis, Dufour Richard (Not affiliated with the original corpus)

WMT'16 Shared Task: Biomedical Translation Task :

Antonio Jimeno Yepes (IBM Research Australia)
Aurélie Névéol (LIMSI, CNRS, France)
Mariana Neves (Hasso-Plattner Institute, Germany)
Karin Verspoor (University of Melbourne, Australia)

Citation Information

Please cite the following paper when using this dataset.

@inproceedings{bojar-etal-2016-findings,
    title = Findings of the 2016 Conference on Machine Translation,
    author = {
      Bojar, Ondrej  and
      Chatterjee, Rajen  and
      Federmann, Christian  and
      Graham, Yvette  and
      Haddow, Barry  and
      Huck, Matthias  and
      Jimeno Yepes, Antonio  and
      Koehn, Philipp  and
      Logacheva, Varvara  and
      Monz, Christof  and
      Negri, Matteo  and
      Neveol, Aurelie  and
      Neves, Mariana  and
      Popel, Martin  and
      Post, Matt  and
      Rubino, Raphael  and
      Scarton, Carolina  and
      Specia, Lucia  and
      Turchi, Marco  and
      Verspoor, Karin  and
      Zampieri, Marcos,
    },
    booktitle = Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers,
    month = aug,
    year = 2016,
    address = Berlin, Germany,
    publisher = Association for Computational Linguistics,
    url = https://aclanthology.org/W16-2301,
    doi = 10.18653/v1/W16-2301,
    pages = 131--198,
}

作者:

qanastek

数据集大小:

57.79 MB