数据集:
qanastek/WMT-16-PubMed
WMT-16-PubMed is a parallel corpus for neural machine translation collected and aligned for ACL 2016 during the WMT'16 Shared Task: Biomedical Translation Task .
translation : The dataset can be used to train a model for translation.
The corpora consists of a pair of source and target sentences for all 4 different languages :
List of languages : English (en) , Spanish (es) , French (fr) , Portuguese (pt) .
from datasets import load_dataset dataset = load_dataset("qanastek/WMT-16-PubMed", split='train', download_mode='force_redownload') print(dataset) print(dataset[0])
lang doc_id workshop publisher source_text target_text 0 en-fr 26839447 WMT'16 Biomedical Translation Task - PubMed pubmed Global Health: Where Do Physiotherapy and Reha... La place des cheveux et des poils dans les rit... 1 en-fr 26837117 WMT'16 Biomedical Translation Task - PubMed pubmed Carabin Les Carabins 2 en-fr 26837116 WMT'16 Biomedical Translation Task - PubMed pubmed In Process Citation Le laboratoire d'Anatomie, Biomécanique et Org... 3 en-fr 26837115 WMT'16 Biomedical Translation Task - PubMed pubmed Comment on the misappropriation of bibliograph... Du détournement des références bibliographique... 4 en-fr 26837114 WMT'16 Biomedical Translation Task - PubMed pubmed Anti-aging medicine, a science-based, essentia... La médecine anti-âge, une médecine scientifiqu... ... ... ... ... ... ... ... 973972 en-pt 20274330 WMT'16 Biomedical Translation Task - PubMed pubmed Myocardial infarction, diagnosis and treatment Infarto do miocárdio; diagnóstico e tratamento 973973 en-pt 20274329 WMT'16 Biomedical Translation Task - PubMed pubmed The health areas politics A política dos campos de saúde 973974 en-pt 20274328 WMT'16 Biomedical Translation Task - PubMed pubmed The role in tissue edema and liquid exchanges ... O papel dos tecidos nos edemas e nas trocas lí... 973975 en-pt 20274327 WMT'16 Biomedical Translation Task - PubMed pubmed About suppuration of the wound after thoracopl... Sôbre as supurações da ferida operatória após ... 973976 en-pt 20274326 WMT'16 Biomedical Translation Task - PubMed pubmed Experimental study of liver lesions in the tre... Estudo experimental das lesões hepáticas no tr...
lang : The pair of source and target language of type String .
source_text : The source text of type String .
target_text : The target text of type String .
en-es : 285,584
en-fr : 614,093
en-pt : 74,300
For details, check the corresponding pages .
The shared task as been organized by :
The corpora is free of personal or sensitive information.
The nature of the task introduce a variability in the quality of the target translations.
Hugging Face WMT-16-PubMed : Labrak Yanis, Dufour Richard (Not affiliated with the original corpus)
WMT'16 Shared Task: Biomedical Translation Task :
Please cite the following paper when using this dataset.
@inproceedings{bojar-etal-2016-findings, title = Findings of the 2016 Conference on Machine Translation, author = { Bojar, Ondrej and Chatterjee, Rajen and Federmann, Christian and Graham, Yvette and Haddow, Barry and Huck, Matthias and Jimeno Yepes, Antonio and Koehn, Philipp and Logacheva, Varvara and Monz, Christof and Negri, Matteo and Neveol, Aurelie and Neves, Mariana and Popel, Martin and Post, Matt and Rubino, Raphael and Scarton, Carolina and Specia, Lucia and Turchi, Marco and Verspoor, Karin and Zampieri, Marcos, }, booktitle = Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, month = aug, year = 2016, address = Berlin, Germany, publisher = Association for Computational Linguistics, url = https://aclanthology.org/W16-2301, doi = 10.18653/v1/W16-2301, pages = 131--198, }