数据集:
gsarti/wmt_vat
This dataset comprises 70 small and discriminative test sets for machine translation (MT) evaluation called variance-aware test sets (VAT), covering 35 translation directions from WMT16 to WMT20 competitions. VAT is automatically created by a novel variance-aware filtering method that filters the indiscriminative test instances of the current MT benchmark without any human labor. Experimental results show that VAT outperforms the original WMT benchmark in terms of the correlation with human judgment across mainstream language pairs and test sets. Further analysis on the properties of VAT reveals the challenging linguistic features (e.g., translation of low-frequency words and proper nouns) for the competitive MT systems, providing guidance for constructing future MT test sets.
Disclaimer : *The VAT test sets are hosted through Github by the Natural Language Processing & Portuguese-Chinese Machine Translation Laboratory (NLP2CT Lab) of the University of Macau. They were introduced by the paper Variance-Aware Machine Translation Test Sets by Runzhe Zhan , Xuebo Liu , Derek F. Wong , Lidia S. Chao and follow the original licensing for WMT test sets.
Refer to the original paper for additional details on model evaluation on VAT.
The following table taken from the original paper lists the languages supported by the VAT test sets, for a total of 70 language pairs:
↔️ | wmt16 | wmt17 | wmt18 | wmt19 | wmt20 |
---|---|---|---|---|---|
xx_en | cs , de , fi , ro , ru , tr | cs , de , fi , lv , ru , tr , zh | cs , de , et , fi , ru , tr , zh | de , fi , gu , kk , lt , ru , zh | cs , de , iu , ja , km , pl , ps , ru , ta , zh |
en_xx | ru | cs , de , fi , lv , ru , tr , zh | cs , de , et , fi , ru , tr , zh | cs , de , fi , gu , kk , lt , ru , zh | cs , de , ja , pl , ru , ta , zh |
xx_yy | / | / | / | de_cs , de_fr , fr_de | / |
To use any one of the test set, pass wmtXX_src_tgt as configuration name to the load_dataset command. E.g. to load the English-Russian test set from wmt16 , use load_dataset('gsarti/wmt_vat', 'wmt16_en_ru') .
A sample from the test split (the only available split) for the WMT16 English-Russian language ( wmt16_en_ru config) is provided below. All configurations have the same structure.
{ 'orig_id': 0, 'source': 'The social card of residents of Ivanovo region is to be recognised as an electronic payment instrument.', 'reference': 'Социальная карта жителя Ивановской области признается электронным средством платежа.' }
The text is provided as-in the original dataset, without further preprocessing or tokenization.
Taken from the original repository:
Configuration | # Sentences | # Words | # Vocabulary |
---|---|---|---|
wmt20_km_en | 928 | 17170 | 3645 |
wmt20_cs_en | 266 | 12568 | 3502 |
wmt20_en_de | 567 | 21336 | 5945 |
wmt20_ja_en | 397 | 10526 | 3063 |
wmt20_ps_en | 1088 | 20296 | 4303 |
wmt20_en_zh | 567 | 18224 | 5019 |
wmt20_en_ta | 400 | 7809 | 4028 |
wmt20_de_en | 314 | 16083 | 4046 |
wmt20_zh_en | 800 | 35132 | 6457 |
wmt20_en_ja | 400 | 12718 | 2969 |
wmt20_en_cs | 567 | 16579 | 6391 |
wmt20_en_pl | 400 | 8423 | 3834 |
wmt20_en_ru | 801 | 17446 | 6877 |
wmt20_pl_en | 400 | 7394 | 2399 |
wmt20_iu_en | 1188 | 23494 | 3876 |
wmt20_ru_en | 396 | 6966 | 2330 |
wmt20_ta_en | 399 | 7427 | 2148 |
wmt19_zh_en | 800 | 36739 | 6168 |
wmt19_en_cs | 799 | 15433 | 6111 |
wmt19_de_en | 800 | 15219 | 4222 |
wmt19_en_gu | 399 | 8494 | 3548 |
wmt19_fr_de | 680 | 12616 | 3698 |
wmt19_en_zh | 799 | 20230 | 5547 |
wmt19_fi_en | 798 | 13759 | 3555 |
wmt19_en_fi | 799 | 13303 | 6149 |
wmt19_kk_en | 400 | 9283 | 2584 |
wmt19_de_cs | 799 | 15080 | 6166 |
wmt19_lt_en | 400 | 10474 | 2874 |
wmt19_en_lt | 399 | 7251 | 3364 |
wmt19_ru_en | 800 | 14693 | 3817 |
wmt19_en_kk | 399 | 6411 | 3252 |
wmt19_en_ru | 799 | 16393 | 6125 |
wmt19_gu_en | 406 | 8061 | 2434 |
wmt19_de_fr | 680 | 16181 | 3517 |
wmt19_en_de | 799 | 18946 | 5340 |
wmt18_en_cs | 1193 | 19552 | 7926 |
wmt18_cs_en | 1193 | 23439 | 5453 |
wmt18_en_fi | 1200 | 16239 | 7696 |
wmt18_en_tr | 1200 | 19621 | 8613 |
wmt18_en_et | 800 | 13034 | 6001 |
wmt18_ru_en | 1200 | 26747 | 6045 |
wmt18_et_en | 800 | 20045 | 5045 |
wmt18_tr_en | 1200 | 25689 | 5955 |
wmt18_fi_en | 1200 | 24912 | 5834 |
wmt18_zh_en | 1592 | 42983 | 7985 |
wmt18_en_zh | 1592 | 34796 | 8579 |
wmt18_en_ru | 1200 | 22830 | 8679 |
wmt18_de_en | 1199 | 28275 | 6487 |
wmt18_en_de | 1199 | 25473 | 7130 |
wmt17_en_lv | 800 | 14453 | 6161 |
wmt17_zh_en | 800 | 20590 | 5149 |
wmt17_en_tr | 1203 | 17612 | 7714 |
wmt17_lv_en | 800 | 18653 | 4747 |
wmt17_en_de | 1202 | 22055 | 6463 |
wmt17_ru_en | 1200 | 24807 | 5790 |
wmt17_en_fi | 1201 | 17284 | 7763 |
wmt17_tr_en | 1203 | 23037 | 5387 |
wmt17_en_zh | 800 | 18001 | 5629 |
wmt17_en_ru | 1200 | 22251 | 8761 |
wmt17_fi_en | 1201 | 23791 | 5300 |
wmt17_en_cs | 1202 | 21278 | 8256 |
wmt17_de_en | 1202 | 23838 | 5487 |
wmt17_cs_en | 1202 | 22707 | 5310 |
wmt16_tr_en | 1200 | 19225 | 4823 |
wmt16_ru_en | 1199 | 23010 | 5442 |
wmt16_ro_en | 800 | 16200 | 3968 |
wmt16_de_en | 1200 | 22612 | 5511 |
wmt16_en_ru | 1199 | 20233 | 7872 |
wmt16_fi_en | 1200 | 20744 | 5176 |
wmt16_cs_en | 1200 | 23235 | 5324 |
The dataset was created by retaining a subset of the top 40% instances from various WMT test sets for which the variance between automatic scores (BLEU, BLEURT, COMET, BERTScore) was the highest. Please refer to the original article Variance-Aware Machine Translation Test Sets for additional information on dataset creation.
The original authors of VAT are the curators of the original dataset. For problems or updates on this ? Datasets version, please contact gabriele.sarti996@gmail.com .
The variance-aware test set were created based on the original WMT test set. Thus, the the original data licensing plan already stated by WMT organizers is still applicable:
The data released for the WMT news translation task can be freely used for research purposes, we just ask that you cite the WMT shared task overview paper, and respect any additional citation requirements on the individual data sets. For other uses of the data, you should consult with original owners of the data sets.
Please cite the authors if you use these corpora in your work. It is also advised to cite the original WMT shared task paper for the specific test sets that were used.
@inproceedings{ zhan2021varianceaware, title={Variance-Aware Machine Translation Test Sets}, author={Runzhe Zhan and Xuebo Liu and Derek F. Wong and Lidia S. Chao}, booktitle={Thirty-fifth Conference on Neural Information Processing Systems, Datasets and Benchmarks Track}, year={2021}, url={https://openreview.net/forum?id=hhKA5k0oVy5} }