数据集:
wmt20_mlqe_task1
From the homepage: This shared task (part of WMT20) will build on its previous editions to further examine automatic methods for estimating the quality of neural machine translation output at run-time, without relying on reference translations. As in previous years, we cover estimation at various levels. Important elements introduced this year include: a new task where sentences are annotated with Direct Assessment (DA) scores instead of labels based on post-editing; a new multilingual sentence-level dataset mainly from Wikipedia articles, where the source articles can be retrieved for document-wide context; the availability of NMT models to explore system-internal information for the task.
Task 1 uses Wikipedia data for 6 language pairs that includes high-resource English--German (En-De) and English--Chinese (En-Zh), medium-resource Romanian--English (Ro-En) and Estonian--English (Et-En), and low-resource Sinhalese--English (Si-En) and Nepalese--English (Ne-En), as well as a dataset with a combination of Wikipedia articles and Reddit articles for Russian-English (En-Ru). The datasets were collected by translating sentences sampled from source language articles using state-of-the-art NMT models built using the fairseq toolkit and annotated with Direct Assessment (DA) scores by professional translators. Each sentence was annotated following the FLORES setup, which presents a form of DA, where at least three professional translators rate each sentence from 0-100 according to the perceived translation quality. DA scores are standardised using the z-score by rater. Participating systems are required to score sentences according to z-standardised DA scores.
From the homepage:
Sentence-level submissions will be evaluated in terms of the Pearson's correlation metric for the DA prediction agains human DA (z-standardised mean DA score, i.e. z_mean). These are the official evaluation scripts . The evaluation will focus on multilingual systems, i.e. systems that are able to provide predictions for all languages in the Wikipedia domain. Therefore, average Pearson correlation across all these languages will be used to rank QE systems. We will also evaluate QE systems on a per-language basis for those interested in particular languages.
Eight languages are represented in this dataset:
An example looks like this:
{ 'segid': 123, 'translation': { 'en': 'José Ortega y Gasset visited Husserl at Freiburg in 1934.', 'de': '1934 besuchte José Ortega y Gasset Husserl in Freiburg.', }, 'scores': [100.0, 100.0, 100.0], 'mean': 100.0, 'z_scores': [0.9553316831588745, 1.552362322807312, 0.850531816482544], 'z_mean': 1.1194086074829102, 'model_score': -0.10244649648666382, 'doc_id': 'Edmund Husserl', 'nmt_output': '1934 besuchte José Ort@@ ega y G@@ asset Hus@@ ser@@ l in Freiburg .', 'word_probas': [-0.4458000063896179, -0.2745000123977661, -0.07199999690055847, -0.002300000051036477, -0.005900000222027302, -0.14579999446868896, -0.07500000298023224, -0.012400000356137753, -0.026900000870227814, -0.036400001496076584, -0.05299999937415123, -0.14990000426769257, -0.012400000356137753, -0.1145000010728836, -0.10999999940395355], }
There are 7 configurations in this dataset (one for each available language pair). Each configuration is composed of 7K examples for training, 1K for validation and 1K for test.
The original text is extracted from Wikipedia, Russian Reddit and Russian WikiQuotes. Translations are obtained using state-of-the-art NMT models built using the fairseq toolkit and annotated with Direct Assesment scores by professional translators.
[More Information Needed]
Initial Data Collection and Normalization[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Annotation process[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
Unknown
Not available.
Thanks to @VictorSanh for adding this dataset.