中文

Dataset Card for PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification

Dataset Summary

This dataset contains 23,659 human translated PAWS evaluation pairs and 296,406 machine translated training pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. All translated pairs are sourced from examples in PAWS-Wiki .

For further details, see the accompanying paper: PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification

Supported Tasks and Leaderboards

It has been majorly used for paraphrase identification for English and other 6 languages namely French, Spanish, German, Chinese, Japanese, and Korean

Languages

The dataset is in English, French, Spanish, German, Chinese, Japanese, and Korean

Dataset Structure

Data Instances

For en:

id		    :   1
sentence1	:	In Paris , in October 1560 , he secretly met the English ambassador , Nicolas Throckmorton , asking him for a passport to return to England through Scotland .
sentence2	:	In October 1560 , he secretly met with the English ambassador , Nicolas Throckmorton , in Paris , and asked him for a passport to return to Scotland through England .
label       :   0

For fr:

id		    :   1
sentence1	:	À Paris, en octobre 1560, il rencontra secrètement l'ambassadeur d'Angleterre, Nicolas Throckmorton, lui demandant un passeport pour retourner en Angleterre en passant par l'Écosse.
sentence2	:	En octobre 1560, il rencontra secrètement l'ambassadeur d'Angleterre, Nicolas Throckmorton, à Paris, et lui demanda un passeport pour retourner en Écosse par l'Angleterre.
label       :   0

Data Fields

All files are in tsv format with four columns:

Column Name Data
id An ID that matches the ID of the source pair in PAWS-Wiki
sentence1 The first sentence
sentence2 The second sentence
label Label for each pair

The source text of each translation can be retrieved by looking up the ID in the corresponding file in PAWS-Wiki.

Data Splits

The numbers of examples for each of the seven languages are shown below:

Language Train Dev Test
en 49,401 2,000 2,000
fr 49,401 2,000 2,000
es 49,401 2,000 2,000
de 49,401 2,000 2,000
zh 49,401 2,000 2,000
ja 49,401 2,000 2,000
ko 49,401 2,000 2,000

Caveat : please note that the dev and test sets of PAWS-X are both sourced from the dev set of PAWS-Wiki. As a consequence, the same sentence 1 may appear in both the dev and test sets. Nevertheless our data split guarantees that there is no overlap on sentence pairs ( sentence 1 + sentence 2 ) between dev and test.

Dataset Creation

Curation Rationale

Most existing work on adversarial data generation focuses on English. For example, PAWS (Paraphrase Adversaries from Word Scrambling) (Zhang et al., 2019) consists of challenging English paraphrase identification pairs from Wikipedia and Quora. They remedy this gap with PAWS-X, a new dataset of 23,659 human translated PAWS evaluation pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. They provide baseline numbers for three models with different capacity to capture non-local context and sentence structure, and using different multilingual training and evaluation regimes. Multilingual BERT (Devlin et al., 2019) fine-tuned on PAWS English plus machine-translated data performs the best, with a range of 83.1-90.8 accuracy across the non-English languages and an average accuracy gain of 23% over the next best model. PAWS-X shows the effectiveness of deep, multilingual pre-training while also leaving considerable headroom as a new challenge to drive multilingual research that better captures structure and contextual information.

Source Data

PAWS (Paraphrase Adversaries from Word Scrambling)

Initial Data Collection and Normalization

All translated pairs are sourced from examples in PAWS-Wiki

Who are the source language producers?

This dataset contains 23,659 human translated PAWS evaluation pairs and 296,406 machine translated training pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean.

Annotations

Annotation process

If applicable, describe the annotation process and any tools used, or state otherwise. Describe the amount of data annotated, if not all. Describe or reference annotation guidelines provided to the annotators. If available, provide interannotator statistics. Describe any annotation validation processes.

Who are the annotators?

The paper mentions the translate team, especially Mengmeng Niu, for the help with the annotations.

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

List the people involved in collecting the dataset and their affiliation(s). If funding information is known, include it here.

Licensing Information

The dataset may be freely used for any purpose, although acknowledgement of Google LLC ("Google") as the data source would be appreciated. The dataset is provided "AS IS" without any warranty, express or implied. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

Citation Information

@InProceedings{pawsx2019emnlp,
  title = {{PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification}},
  author = {Yang, Yinfei and Zhang, Yuan and Tar, Chris and Baldridge, Jason},
  booktitle = {Proc. of EMNLP},
  year = {2019}
}

Contributions

Thanks to @bhavitvyamalik , @gowtham1997 for adding this dataset.