数据集:
asset
子任务:
text-simplification语言:
en计算机处理:
monolingual大小:
1K<n<10K语言创建人:
found批注创建人:
machine-generated许可:
cc-by-sa-4.0ASSET (Alva-Manchego et al., 2020) is multi-reference dataset for the evaluation of sentence simplification in English. The dataset uses the same 2,359 sentences from TurkCorpus (Xu et al., 2016) and each sentence is associated with 10 crowdsourced simplifications. Unlike previous simplification datasets, which contain a single transformation (e.g., lexical paraphrasing in TurkCorpus or sentence splitting in HSplit ), the simplifications in ASSET encompass a variety of rewriting transformations.
The dataset supports the evaluation of text-simplification systems. Success in this tasks is typically measured using the SARI and FKBLEU metrics described in the paper Optimizing Statistical Machine Translation for Text Simplification .
The text in this dataset is in English ( en ).
ASSET does not contain a training set; many models use WikiLarge (Zhang and Lapata, 2017) for training.
Each input sentence has 10 associated reference simplified sentences. The statistics of ASSET are given below.
Dev | Test | Total | |
---|---|---|---|
Input Sentences | 2000 | 359 | 2359 |
Reference Simplifications | 20000 | 3590 | 23590 |
The test and validation sets are the same as those of TurkCorpus. The split was random.
There are 19.04 tokens per reference on average (lower than 21.29 and 25.49 for TurkCorpus and HSplit, respectively). Most (17,245) of the referece sentences do not involve sentence splitting.
ASSET was created in order to improve the evaluation of sentence simplification. It uses the same input sentences as the TurkCorpus dataset from (Xu et al., 2016) . The 2,359 input sentences of TurkCorpus are a sample of "standard" (not simple) sentences from the Parallel Wikipedia Simplification (PWKP) dataset (Zhu et al., 2010) , which come from the August 22, 2009 version of Wikipedia. The sentences of TurkCorpus were chosen to be of similar length (Xu et al., 2016) . No further information is provided on the sampling strategy.
The TurkCorpus dataset was developed in order to overcome some of the problems with sentence pairs from Standard and Simple Wikipedia: a large fraction of sentences were misaligned, or not actually simpler (Xu et al., 2016) . However, TurkCorpus mainly focused on lexical paraphrasing , and so cannot be used to evaluate simplifications involving compression (deletion) or sentence splitting . HSplit (Sulem et al., 2018) , on the other hand, can only be used to evaluate sentence splitting. The reference sentences in ASSET include a wider variety of sentence rewriting strategies, combining splitting, compression and paraphrasing. Annotators were given examples of each kind of transformation individually, as well as all three transformations used at once, but were allowed to decide which transformations to use for any given sentence.
An example illustrating the differences between TurkCorpus, HSplit and ASSET is given below:
Original: He settled in London, devoting himself chiefly to practical teaching.
TurkCorpus: He rooted in London, devoting himself mainly to practical teaching.
HSplit: He settled in London. He devoted himself chiefly to practical teaching.
ASSET: He lived in London. He was a teacher.
[More Information Needed]
Who are the source language producers?The input sentences are from English Wikipedia (August 22, 2009 version). No demographic information is available for the writers of these sentences. However, most Wikipedia editors are male (Lam, 2011; Graells-Garrido, 2015), which has an impact on the topics covered (see also the Wikipedia page on Wikipedia gender bias ). In addition, Wikipedia editors are mostly white, young, and from the Northern Hemisphere (Wikipedia: Systemic bias) .
Reference sentences were written by 42 workers on Amazon Mechanical Turk (AMT). The requirements for being an annotator were:
No other demographic or compensation information is provided in the ASSET paper.
The instructions given to the annotators are available here .
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
The dataset may contain some social biases, as the input sentences are based on Wikipedia. Studies have shown that the English Wikipedia contains both gender biases (Schmahl et al., 2020) and racial biases (Adams et al., 2019).
Adams, Julia, Hannah Brückner, and Cambria Naslund. "Who Counts as a Notable Sociologist on Wikipedia? Gender, Race, and the “Professor Test”." Socius 5 (2019): 2378023118823946. Schmahl, Katja Geertruida, et al. "Is Wikipedia succeeding in reducing gender bias? Assessing changes in gender bias in Wikipedia using word embeddings." Proceedings of the Fourth Workshop on Natural Language Processing and Computational Social Science. 2020.
Dataset provided for research purposes only. Please check dataset license for additional information.
ASSET was developed by researchers at the University of Sheffield, Inria, Facebook AI Research, and Imperial College London. The work was partly supported by Benoît Sagot's chair in the PRAIRIE institute, funded by the French National Research Agency (ANR) as part of the "Investissements d’avenir" program (reference ANR-19-P3IA-0001).
Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
@inproceedings{alva-manchego-etal-2020-asset, title = "{ASSET}: {A} Dataset for Tuning and Evaluation of Sentence Simplification Models with Multiple Rewriting Transformations", author = "Alva-Manchego, Fernando and Martin, Louis and Bordes, Antoine and Scarton, Carolina and Sagot, Beno{\^\i}t and Specia, Lucia", booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics", month = jul, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.acl-main.424", pages = "4668--4679", }
This dataset card uses material written by Juan Diego Rodriguez .
Thanks to @yjernite for adding this dataset.