数据集:
ted_hrlr
任务:
翻译计算机处理:
translation大小:
1M<n<10M语言创建人:
expert-generated批注创建人:
crowdsourced源数据集:
extended|ted_talks_iwslt许可:
cc-by-nc-nd-4.0Data sets derived from TED talk transcripts for comparing similar language pairs where one is high resource and the other is low resource.
An example of 'train' looks as follows.
{ "translation": { "az": "zəhmət olmasa , sizə xitab edən sözlər eşidəndə əlinizi qaldırın .", "en": "please raise your hand if something applies to you ." } }aztr_to_en
An example of 'train' looks as follows.
{ "translation": { "az_tr": "zəhmət olmasa , sizə xitab edən sözlər eşidəndə əlinizi qaldırın .", "en": "please raise your hand if something applies to you ." } }be_to_en
An example of 'train' looks as follows.
{ "translation": { "be": "zəhmət olmasa , sizə xitab edən sözlər eşidəndə əlinizi qaldırın .", "en": "please raise your hand if something applies to you ." } }beru_to_en
An example of 'validation' looks as follows.
This example was too long and was cropped: { "translation": "{\"be_ru\": \"11 yaşımdaydım . səhərin birində , evimizdəki sevinc səslərinə oyandığım indiki kimi yadımdadır .\", \"en\": \"when i was..." }es_to_pt
An example of 'validation' looks as follows.
This example was too long and was cropped: { "translation": "{\"es\": \"11 yaşımdaydım . səhərin birində , evimizdəki sevinc səslərinə oyandığım indiki kimi yadımdadır .\", \"pt\": \"when i was 11..." }
The data fields are the same among all splits.
az_to_enname | train | validation | test |
---|---|---|---|
az_to_en | 5947 | 672 | 904 |
aztr_to_en | 188397 | 672 | 904 |
be_to_en | 4510 | 249 | 665 |
beru_to_en | 212615 | 249 | 665 |
es_to_pt | 44939 | 1017 | 1764 |
@inproceedings{qi-etal-2018-pre, title = "When and Why Are Pre-Trained Word Embeddings Useful for Neural Machine Translation?", author = "Qi, Ye and Sachan, Devendra and Felix, Matthieu and Padmanabhan, Sarguna and Neubig, Graham", booktitle = "Proceedings of the 2018 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)", month = jun, year = "2018", address = "New Orleans, Louisiana", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/N18-2084", doi = "10.18653/v1/N18-2084", pages = "529--535", }
Thanks to @thomwolf , @lewtun , @patrickvonplaten for adding this dataset.