数据集:
refresd
The Rationalized English-French Semantic Divergences (REFreSD) dataset consists of 1,039 English-French sentence-pairs annotated with sentence-level divergence judgments and token-level rationales. The project under which REFreSD was collected aims to advance our fundamental understanding of computational representations and methods for comparing and contrasting text meaning across languages.
semantic-similarity-classification and semantic-similarity-scoring : This dataset can by used to assess the ability of computational methods to detect meaning mismatches between languages. The model performance is measured in terms of accuracy by comparing the model predictions with the human judgments in REFreSD. Details about the results of a BERT-based model, Divergent mBERT, over this dataset can be found in the paper .
The text is in English and French as found on Wikipedia. The associated BCP-47 codes are en and fr .
Each data point looks like this:
{ 'sentence_pair': {'en': 'The invention of farming some 10,000 years ago led to the development of agrarian societies , whether nomadic or peasant , the latter in particular almost always dominated by a strong sense of traditionalism .', 'fr': "En quelques décennies , l' activité économique de la vallée est passée d' une mono-activité agricole essentiellement vivrière , à une quasi mono-activité touristique , si l' on excepte un artisanat du bâtiment traditionnel important , en partie saisonnier ."} 'label': 0, 'all_labels': 0, 'rationale_en': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'rationale_fr': [2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3], }
The dataset contains 1039 sentence pairs in a single "train" split. Of these pairs, 64% are annotated as divergent, and 40% contain fine-grained meaning divergences.
Label | Number of Instances |
---|---|
Unrelated | 252 |
Some meaning difference | 418 |
No meaning different | 369 |
The curators chose the English-French section of the WikiMatrix corpus because (1) it is likely to contain diverse, interesting divergence types since it consists of mined parallel sentences of diverse topics which are not necessarily generated by (human) translations, and (2) Wikipedia and WikiMatrix are widely used resources to train semantic representations and perform cross-lingual transfer in NLP.
The source for this corpus is the English and French portion of the WikiMatrix corpus , which itself was extracted from Wikipedia articles. The curators excluded noisy samples by filtering out sentence pairs that a) were too short or too long, b) consisted mostly of numbers, or c) had a small token-level edit difference.
Who are the source language producers?Some content of Wikipedia articles has been (human) translated from existing articles in another language while others have been written or edited independently in each language. Therefore, information on how the original text is created is not available.
The annotations were collected over the span of three weeks in April 2020. Annotators were presented with an English sentence and a French sentence. First, they highlighted spans and labeled them as 'added', 'changed', or 'other', where added spans contain information not contained in the other sentence, changed spans contain some information that is in the other sentence but whose meaning is not the same, and other spans have some different meaning not covered in the previous two cases, such as idioms. They then assessed the relation between the two sentences as either 'unrelated', 'some meaning differences', or 'no meaning difference'. See the annotation guidelines for more information about the task and the annotation interface, and see the DataSheet for information about the annotator compensation.
The following table contains Inter-Annotator Agreement metrics for the dataset:
Granularity | Method | IAA |
---|---|---|
Sentence | Krippendorf's α | 0.60 |
Span | macro F1 | 45.56 ± 7.60 |
Token | macro F1 | 33.94 ± 8.24 |
This dataset includes annotations from 6 participants recruited from the University of Maryland, College Park (UMD) educational institution. Participants ranged in age from 20–25 years, including one man and five women. For each participant, the curators ensured they were proficient in both languages of interest: three of them self-reported as English native speakers, one as a French native speaker, and two as bilingual English-French speakers.
The dataset contains discussions of people as they appear in Wikipedia articles. It does not contain confidential information, nor does it contain identifying information about the source language producers or the annotators.
Models that are successful in the supported task require sophisticated semantic representations at the sentence level beyond the combined representations of the individual tokens in isolation. Such models could be used to curate parallel corpora for tasks like machine translation, cross-lingual transfer learning, or semantic modeling.
The statements in the dataset, however, are not necessarily representative of the world and may overrepresent one worldview if one language is primarily translated to, rather than an equal distribution of translations between the languages.
The English Wikipedia is known to have significantly more contributors who identify as male than any other gender and who reside in either North America or Europe. This leads to an overrepresentation of male perspectives from these locations in the corpus in terms of both the topics covered and the language used to talk about those topics. It's not clear to what degree this holds true for the French Wikipedia. The REFreSD dataset itself has not yet been examined for the degree to which it contains the gender and other biases seen in the larger Wikipedia datasets.
It is unknown how many of the sentences in the dataset were written independently, and how many were written as translations by either humans or machines from some other language to the languages of interest in this dataset.
The dataset curators are Eleftheria Briakou and Marine Carpuat, who are both affiliated with the University of Maryland, College Park's Department of Computer Science.
The project is licensed under the MIT License .
@inproceedings{briakou-carpuat-2020-detecting, title = "Detecting Fine-Grained Cross-Lingual Semantic Divergences without Supervision by Learning to Rank", author = "Briakou, Eleftheria and Carpuat, Marine", booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.emnlp-main.121", pages = "1563--1580", }
Thanks to @mpariente and @mcmillanmajora for adding this dataset.