数据集:
ted_hrlr
任务:
计算机处理:
translation大小:
1M<n<10M语言创建人:
expert-generated批注创建人:
crowdsourced源数据集:
extended|ted_talks_iwslt许可:
Data sets derived from TED talk transcripts for comparing similar language pairs where one is high resource and the other is low resource.
An example of 'train' looks as follows.
{
"translation": {
"az": "zəhmət olmasa , sizə xitab edən sözlər eşidəndə əlinizi qaldırın .",
"en": "please raise your hand if something applies to you ."
}
}
aztr_to_en
An example of 'train' looks as follows.
{
"translation": {
"az_tr": "zəhmət olmasa , sizə xitab edən sözlər eşidəndə əlinizi qaldırın .",
"en": "please raise your hand if something applies to you ."
}
}
be_to_en
An example of 'train' looks as follows.
{
"translation": {
"be": "zəhmət olmasa , sizə xitab edən sözlər eşidəndə əlinizi qaldırın .",
"en": "please raise your hand if something applies to you ."
}
}
beru_to_en
An example of 'validation' looks as follows.
This example was too long and was cropped:
{
"translation": "{\"be_ru\": \"11 yaşımdaydım . səhərin birində , evimizdəki sevinc səslərinə oyandığım indiki kimi yadımdadır .\", \"en\": \"when i was..."
}
es_to_pt
An example of 'validation' looks as follows.
This example was too long and was cropped:
{
"translation": "{\"es\": \"11 yaşımdaydım . səhərin birində , evimizdəki sevinc səslərinə oyandığım indiki kimi yadımdadır .\", \"pt\": \"when i was 11..."
}
The data fields are the same among all splits.
az_to_en| name | train | validation | test |
|---|---|---|---|
| az_to_en | 5947 | 672 | 904 |
| aztr_to_en | 188397 | 672 | 904 |
| be_to_en | 4510 | 249 | 665 |
| beru_to_en | 212615 | 249 | 665 |
| es_to_pt | 44939 | 1017 | 1764 |
@inproceedings{qi-etal-2018-pre,
title = "When and Why Are Pre-Trained Word Embeddings Useful for Neural Machine Translation?",
author = "Qi, Ye and
Sachan, Devendra and
Felix, Matthieu and
Padmanabhan, Sarguna and
Neubig, Graham",
booktitle = "Proceedings of the 2018 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)",
month = jun,
year = "2018",
address = "New Orleans, Louisiana",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/N18-2084",
doi = "10.18653/v1/N18-2084",
pages = "529--535",
}
Thanks to @thomwolf , @lewtun , @patrickvonplaten for adding this dataset.