数据集:

iwslt2017

任务:

翻译

计算机处理:

translation

大小:

1M<n<10M

语言创建人:

expert-generated

批注创建人:

crowdsourced

源数据集:

original
中文

Dataset Card for IWSLT 2017

Dataset Summary

The IWSLT 2017 Multilingual Task addresses text translation, including zero-shot translation, with a single MT system across all directions including English, German, Dutch, Italian and Romanian. As unofficial task, conventional bilingual text translation is offered between English and Arabic, French, Japanese, Chinese, German and Korean.

Supported Tasks and Leaderboards

More Information Needed

Languages

More Information Needed

Dataset Structure

Data Instances

iwslt2017-ar-en
  • Size of downloaded dataset files: 27.75 MB
  • Size of the generated dataset: 58.74 MB
  • Total amount of disk used: 86.49 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "translation": "{\"ar\": \"لقد طرت في \\\"القوات الجوية \\\" لمدة ثمان سنوات. والآن أجد نفسي مضطرا لخلع حذائي قبل صعود الطائرة!\", \"en\": \"I flew on Air ..."
}
iwslt2017-de-en
  • Size of downloaded dataset files: 16.76 MB
  • Size of the generated dataset: 44.43 MB
  • Total amount of disk used: 61.18 MB

An example of 'train' looks as follows.

{
    "translation": {
        "de": "Es ist mir wirklich eine Ehre, zweimal auf dieser Bühne stehen zu dürfen. Tausend Dank dafür.",
        "en": "And it's truly a great honor to have the opportunity to come to this stage twice; I'm extremely grateful."
    }
}
iwslt2017-en-ar
  • Size of downloaded dataset files: 29.33 MB
  • Size of the generated dataset: 58.74 MB
  • Total amount of disk used: 88.07 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "translation": "{\"ar\": \"لقد طرت في \\\"القوات الجوية \\\" لمدة ثمان سنوات. والآن أجد نفسي مضطرا لخلع حذائي قبل صعود الطائرة!\", \"en\": \"I flew on Air ..."
}
iwslt2017-en-de
  • Size of downloaded dataset files: 16.76 MB
  • Size of the generated dataset: 44.43 MB
  • Total amount of disk used: 61.18 MB

An example of 'validation' looks as follows.

{
    "translation": {
        "de": "Die nächste Folie, die ich Ihnen zeige, ist eine Zeitrafferaufnahme was in den letzten 25 Jahren passiert ist.",
        "en": "The next slide I show you will be  a rapid fast-forward of what's happened over the last 25 years."
    }
}
iwslt2017-en-fr
  • Size of downloaded dataset files: 27.69 MB
  • Size of the generated dataset: 51.24 MB
  • Total amount of disk used: 78.94 MB

An example of 'validation' looks as follows.

{
    "translation": {
        "en": "But this understates the seriousness of this particular problem  because it doesn't show the thickness of the ice.",
        "fr": "Mais ceci tend à amoindrir le problème parce qu'on ne voit pas l'épaisseur de la glace."
    }
}

Data Fields

The data fields are the same among all splits.

iwslt2017-ar-en
  • translation : a multilingual string variable, with possible languages including ar , en .
iwslt2017-de-en
  • translation : a multilingual string variable, with possible languages including de , en .
iwslt2017-en-ar
  • translation : a multilingual string variable, with possible languages including en , ar .
iwslt2017-en-de
  • translation : a multilingual string variable, with possible languages including en , de .
iwslt2017-en-fr
  • translation : a multilingual string variable, with possible languages including en , fr .

Data Splits

name train validation test
iwslt2017-ar-en 231713 888 8583
iwslt2017-de-en 206112 888 8079
iwslt2017-en-ar 231713 888 8583
iwslt2017-en-de 206112 888 8079
iwslt2017-en-fr 232825 890 8597

Dataset Creation

Curation Rationale

More Information Needed

Source Data

Initial Data Collection and Normalization

More Information Needed

Who are the source language producers?

More Information Needed

Annotations

Annotation process

More Information Needed

Who are the annotators?

More Information Needed

Personal and Sensitive Information

More Information Needed

Considerations for Using the Data

Social Impact of Dataset

More Information Needed

Discussion of Biases

More Information Needed

Other Known Limitations

More Information Needed

Additional Information

Dataset Curators

More Information Needed

Licensing Information

Creative Commons BY-NC-ND

See the (TED Talks Usage Policy)[ https://www.ted.com/about/our-organization/our-policies-terms/ted-talks-usage-policy] .

Citation Information

@inproceedings{cettolo-etal-2017-overview,
    title = "Overview of the {IWSLT} 2017 Evaluation Campaign",
    author = {Cettolo, Mauro  and
      Federico, Marcello  and
      Bentivogli, Luisa  and
      Niehues, Jan  and
      St{\"u}ker, Sebastian  and
      Sudoh, Katsuhito  and
      Yoshino, Koichiro  and
      Federmann, Christian},
    booktitle = "Proceedings of the 14th International Conference on Spoken Language Translation",
    month = dec # " 14-15",
    year = "2017",
    address = "Tokyo, Japan",
    publisher = "International Workshop on Spoken Language Translation",
    url = "https://aclanthology.org/2017.iwslt-1.1",
    pages = "2--14",
}

Contributions

Thanks to @thomwolf , @Narsil for adding this dataset.