数据集:
turkic_xwmt
To establish a comprehensive and challenging evaluation benchmark for Machine Translation in Turkic languages, we translate a test set originally introduced in WMT 2020 News Translation Task for English-Russian. The original dataset is profesionally translated and consists of sentences from news articles that are both English and Russian-centric. We adopt this evaluation set (X-WMT) and begin efforts to translate it into several Turkic languages. The current version of X-WMT includes covers 8 Turkic languages and 88 language directions with a minimum of 300 sentences per language direction.
[More Information Needed]
Currently covered languages are (besides English and Russian):
A random example from the Russian-Uzbek set:
{"translation": {'ru': 'Моника Мутсвангва , министр информации Зимбабве , утверждает , что полиция вмешалась в отъезд Магомбейи из соображений безопасности и вследствие состояния его здоровья .', 'uz': 'Zimbabvening Axborot vaziri , Monika Mutsvanva Magombeyining xavfsizligi va sog'ligi tufayli bo'lgan jo'nab ketishinida politsiya aralashuvini ushlab turadi .'}}
Each example has one field "translation" that contains two subfields: one per language, e.g. for the Russian-Uzbek set:
[More Information Needed]
[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?Translators, annotators and dataset contributors (in alphabetical order)
Abilxayr Zholdybai Aigiz Kunafin Akylbek Khamitov Alperen Cantez Aydos Muxammadiyarov Doniyorbek Rafikjonov Erkinbek Vokhabov Ipek Baris Iskander Shakirov Madina Zokirjonova Mohiyaxon Uzoqova Mukhammadbektosh Khaydarov Nurlan Maharramli Petr Popov Rasul Karimov Sariya Kagarmanova Ziyodabonu Qobiljon qizi
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
@inproceedings{mirzakhalov2021large, title={A Large-Scale Study of Machine Translation in Turkic Languages}, author={Mirzakhalov, Jamshidbek and Babu, Anoop and Ataman, Duygu and Kariev, Sherzod and Tyers, Francis and Abduraufov, Otabek and Hajili, Mammad and Ivanova, Sardana and Khaytbaev, Abror and Laverghetta Jr, Antonio and others}, booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing}, pages={5876--5890}, year={2021} }
This project was carried out with the help and contributions from dozens of individuals and organizations. We acknowledge and greatly appreciate each and every one of them:
Authors on the publications (in alphabetical order)
Abror Khaytbaev Ahsan Wahab Aigiz Kunafin Anoop Babu Antonio Laverghetta Jr. Behzodbek Moydinboyev Dr. Duygu Ataman Esra Onal Dr. Francis Tyers Jamshidbek Mirzakhalov Dr. John Licato Dr. Julia Kreutzer Mammad Hajili Mokhiyakhon Uzokova Dr. Orhan Firat Otabek Abduraufov Sardana Ivanova Shaxnoza Pulatova Sherzod Kariev Dr. Sriram Chellappan
Translators, annotators and dataset contributors (in alphabetical order)
Abilxayr Zholdybai Aigiz Kunafin Akylbek Khamitov Alperen Cantez Aydos Muxammadiyarov Doniyorbek Rafikjonov Erkinbek Vokhabov Ipek Baris Iskander Shakirov Madina Zokirjonova Mohiyaxon Uzoqova Mukhammadbektosh Khaydarov Nurlan Maharramli Petr Popov Rasul Karimov Sariya Kagarmanova Ziyodabonu Qobiljon qizi
Industry supporters
Google Cloud Khan Academy Oʻzbek The Foundation for the Preservation and Development of the Bashkir Language
Thanks to @mirzakhalov for adding this dataset.