数据集:

gsarti/wmt_vat

任务:

文本生成

翻译

语言:

计算机处理:

multilingual translation

大小:

size_categories:unknown

语言创建人:

expert-generated

批注创建人:

found

源数据集:

extended|wmt16 extended|wmt17 extended|wmt18

其他:

conditional-text-generation

许可:

license:unknown

数据集介绍文件清单

中文

Dataset Card for Variance-Aware MT Test Sets

Dataset Summary

This dataset comprises 70 small and discriminative test sets for machine translation (MT) evaluation called variance-aware test sets (VAT), covering 35 translation directions from WMT16 to WMT20 competitions. VAT is automatically created by a novel variance-aware filtering method that filters the indiscriminative test instances of the current MT benchmark without any human labor. Experimental results show that VAT outperforms the original WMT benchmark in terms of the correlation with human judgment across mainstream language pairs and test sets. Further analysis on the properties of VAT reveals the challenging linguistic features (e.g., translation of low-frequency words and proper nouns) for the competitive MT systems, providing guidance for constructing future MT test sets.

Disclaimer : *The VAT test sets are hosted through Github by the Natural Language Processing & Portuguese-Chinese Machine Translation Laboratory (NLP2CT Lab) of the University of Macau. They were introduced by the paper Variance-Aware Machine Translation Test Sets by Runzhe Zhan , Xuebo Liu , Derek F. Wong , Lidia S. Chao and follow the original licensing for WMT test sets.

Supported Tasks and Leaderboards

Machine Translation

Refer to the original paper for additional details on model evaluation on VAT.

Languages

The following table taken from the original paper lists the languages supported by the VAT test sets, for a total of 70 language pairs:

↔️	wmt16	wmt17	wmt18	wmt19	wmt20
xx_en	cs , de , fi , ro , ru , tr	cs , de , fi , lv , ru , tr , zh	cs , de , et , fi , ru , tr , zh	de , fi , gu , kk , lt , ru , zh	cs , de , iu , ja , km , pl , ps , ru , ta , zh
en_xx	ru	cs , de , fi , lv , ru , tr , zh	cs , de , et , fi , ru , tr , zh	cs , de , fi , gu , kk , lt , ru , zh	cs , de , ja , pl , ru , ta , zh
xx_yy	/	/	/	de_cs , de_fr , fr_de	/

To use any one of the test set, pass wmtXX_src_tgt as configuration name to the load_dataset command. E.g. to load the English-Russian test set from wmt16 , use load_dataset('gsarti/wmt_vat', 'wmt16_en_ru') .

Dataset Structure

Data Instances

A sample from the test split (the only available split) for the WMT16 English-Russian language ( wmt16_en_ru config) is provided below. All configurations have the same structure.

{
    'orig_id': 0,
    'source': 'The social card of residents of Ivanovo region is to be recognised as an electronic payment instrument.',
    'reference': 'Социальная карта жителя Ивановской области признается электронным средством платежа.'
}

The text is provided as-in the original dataset, without further preprocessing or tokenization.

Data Fields

orig_id : Id corresponding to the row id in the original dataset, before variance-aware filtering.
source : The source sentence.
reference : The reference sentence in the target language.

Data Splits

Taken from the original repository:

Configuration	# Sentences	# Words	# Vocabulary
wmt20_km_en	928	17170	3645
wmt20_cs_en	266	12568	3502
wmt20_en_de	567	21336	5945
wmt20_ja_en	397	10526	3063
wmt20_ps_en	1088	20296	4303
wmt20_en_zh	567	18224	5019
wmt20_en_ta	400	7809	4028
wmt20_de_en	314	16083	4046
wmt20_zh_en	800	35132	6457
wmt20_en_ja	400	12718	2969
wmt20_en_cs	567	16579	6391
wmt20_en_pl	400	8423	3834
wmt20_en_ru	801	17446	6877
wmt20_pl_en	400	7394	2399
wmt20_iu_en	1188	23494	3876
wmt20_ru_en	396	6966	2330
wmt20_ta_en	399	7427	2148
wmt19_zh_en	800	36739	6168
wmt19_en_cs	799	15433	6111
wmt19_de_en	800	15219	4222
wmt19_en_gu	399	8494	3548
wmt19_fr_de	680	12616	3698
wmt19_en_zh	799	20230	5547
wmt19_fi_en	798	13759	3555
wmt19_en_fi	799	13303	6149
wmt19_kk_en	400	9283	2584
wmt19_de_cs	799	15080	6166
wmt19_lt_en	400	10474	2874
wmt19_en_lt	399	7251	3364
wmt19_ru_en	800	14693	3817
wmt19_en_kk	399	6411	3252
wmt19_en_ru	799	16393	6125
wmt19_gu_en	406	8061	2434
wmt19_de_fr	680	16181	3517
wmt19_en_de	799	18946	5340
wmt18_en_cs	1193	19552	7926
wmt18_cs_en	1193	23439	5453
wmt18_en_fi	1200	16239	7696
wmt18_en_tr	1200	19621	8613
wmt18_en_et	800	13034	6001
wmt18_ru_en	1200	26747	6045
wmt18_et_en	800	20045	5045
wmt18_tr_en	1200	25689	5955
wmt18_fi_en	1200	24912	5834
wmt18_zh_en	1592	42983	7985
wmt18_en_zh	1592	34796	8579
wmt18_en_ru	1200	22830	8679
wmt18_de_en	1199	28275	6487
wmt18_en_de	1199	25473	7130
wmt17_en_lv	800	14453	6161
wmt17_zh_en	800	20590	5149
wmt17_en_tr	1203	17612	7714
wmt17_lv_en	800	18653	4747
wmt17_en_de	1202	22055	6463
wmt17_ru_en	1200	24807	5790
wmt17_en_fi	1201	17284	7763
wmt17_tr_en	1203	23037	5387
wmt17_en_zh	800	18001	5629
wmt17_en_ru	1200	22251	8761
wmt17_fi_en	1201	23791	5300
wmt17_en_cs	1202	21278	8256
wmt17_de_en	1202	23838	5487
wmt17_cs_en	1202	22707	5310
wmt16_tr_en	1200	19225	4823
wmt16_ru_en	1199	23010	5442
wmt16_ro_en	800	16200	3968
wmt16_de_en	1200	22612	5511
wmt16_en_ru	1199	20233	7872
wmt16_fi_en	1200	20744	5176
wmt16_cs_en	1200	23235	5324

Dataset Creation

The dataset was created by retaining a subset of the top 40% instances from various WMT test sets for which the variance between automatic scores (BLEU, BLEURT, COMET, BERTScore) was the highest. Please refer to the original article Variance-Aware Machine Translation Test Sets for additional information on dataset creation.

Additional Information

Dataset Curators

The original authors of VAT are the curators of the original dataset. For problems or updates on this 🤗 Datasets version, please contact gabriele.sarti996@gmail.com .

Licensing Information

The variance-aware test set were created based on the original WMT test set. Thus, the the original data licensing plan already stated by WMT organizers is still applicable:

The data released for the WMT news translation task can be freely used for research purposes, we just ask that you cite the WMT shared task overview paper, and respect any additional citation requirements on the individual data sets. For other uses of the data, you should consult with original owners of the data sets.

Citation Information

Please cite the authors if you use these corpora in your work. It is also advised to cite the original WMT shared task paper for the specific test sets that were used.

@inproceedings{
    zhan2021varianceaware,
    title={Variance-Aware Machine Translation Test Sets},
    author={Runzhe Zhan and Xuebo Liu and Derek F. Wong and Lidia S. Chao},
    booktitle={Thirty-fifth Conference on Neural Information Processing Systems, Datasets and Benchmarks Track},
    year={2021},
    url={https://openreview.net/forum?id=hhKA5k0oVy5}
}

作者:

gsarti

数据集大小:

204.92 KB