数据集:
juletxara/pawsx_mt
任务:
文本分类语言:
en计算机处理:
multilingual大小:
10K<n<100K源数据集:
extended|other-paws预印本库:
arxiv:1908.11828许可:
otherThis dataset contains 23,659 human translated PAWS evaluation pairs and 296,406 machine translated training pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. All translated pairs are sourced from examples in PAWS-Wiki .
For further details, see the accompanying paper: PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification
This is a machine-translated version of the original dataset into English from each langauge.
It has been majorly used for paraphrase identification for English and other 6 languages namely French, Spanish, German, Chinese, Japanese, and Korean
The dataset is in English, French, Spanish, German, Chinese, Japanese, and Korean
For en:
id : 1 sentence1 : In Paris , in October 1560 , he secretly met the English ambassador , Nicolas Throckmorton , asking him for a passport to return to England through Scotland . sentence2 : In October 1560 , he secretly met with the English ambassador , Nicolas Throckmorton , in Paris , and asked him for a passport to return to Scotland through England . label : 0
For fr:
id : 1 sentence1 : À Paris, en octobre 1560, il rencontra secrètement l'ambassadeur d'Angleterre, Nicolas Throckmorton, lui demandant un passeport pour retourner en Angleterre en passant par l'Écosse. sentence2 : En octobre 1560, il rencontra secrètement l'ambassadeur d'Angleterre, Nicolas Throckmorton, à Paris, et lui demanda un passeport pour retourner en Écosse par l'Angleterre. label : 0
All files are in tsv format with four columns:
Column Name | Data |
---|---|
id | An ID that matches the ID of the source pair in PAWS-Wiki |
sentence1 | The first sentence |
sentence2 | The second sentence |
label | Label for each pair |
The source text of each translation can be retrieved by looking up the ID in the corresponding file in PAWS-Wiki.
The numbers of examples for each of the seven languages are shown below:
Language | Train | Dev | Test |
---|---|---|---|
en | 49,401 | 2,000 | 2,000 |
fr | 49,401 | 2,000 | 2,000 |
es | 49,401 | 2,000 | 2,000 |
de | 49,401 | 2,000 | 2,000 |
zh | 49,401 | 2,000 | 2,000 |
ja | 49,401 | 2,000 | 2,000 |
ko | 49,401 | 2,000 | 2,000 |
Caveat : please note that the dev and test sets of PAWS-X are both sourced from the dev set of PAWS-Wiki. As a consequence, the same sentence 1 may appear in both the dev and test sets. Nevertheless our data split guarantees that there is no overlap on sentence pairs ( sentence 1 + sentence 2 ) between dev and test.
Most existing work on adversarial data generation focuses on English. For example, PAWS (Paraphrase Adversaries from Word Scrambling) (Zhang et al., 2019) consists of challenging English paraphrase identification pairs from Wikipedia and Quora. They remedy this gap with PAWS-X, a new dataset of 23,659 human translated PAWS evaluation pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. They provide baseline numbers for three models with different capacity to capture non-local context and sentence structure, and using different multilingual training and evaluation regimes. Multilingual BERT (Devlin et al., 2019) fine-tuned on PAWS English plus machine-translated data performs the best, with a range of 83.1-90.8 accuracy across the non-English languages and an average accuracy gain of 23% over the next best model. PAWS-X shows the effectiveness of deep, multilingual pre-training while also leaving considerable headroom as a new challenge to drive multilingual research that better captures structure and contextual information.
PAWS (Paraphrase Adversaries from Word Scrambling)
Initial Data Collection and NormalizationAll translated pairs are sourced from examples in PAWS-Wiki
Who are the source language producers?This dataset contains 23,659 human translated PAWS evaluation pairs and 296,406 machine translated training pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean.
If applicable, describe the annotation process and any tools used, or state otherwise. Describe the amount of data annotated, if not all. Describe or reference annotation guidelines provided to the annotators. If available, provide interannotator statistics. Describe any annotation validation processes.
Who are the annotators?The paper mentions the translate team, especially Mengmeng Niu, for the help with the annotations.
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
List the people involved in collecting the dataset and their affiliation(s). If funding information is known, include it here.
The dataset may be freely used for any purpose, although acknowledgement of Google LLC ("Google") as the data source would be appreciated. The dataset is provided "AS IS" without any warranty, express or implied. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.
@InProceedings{pawsx2019emnlp, title = {{PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification}}, author = {Yang, Yinfei and Zhang, Yuan and Tar, Chris and Baldridge, Jason}, booktitle = {Proc. of EMNLP}, year = {2019} }
Thanks to @bhavitvyamalik , @gowtham1997 for adding this dataset.