数据集:
turk
任务:
文生文子任务:
text-simplification语言:
en计算机处理:
monolingual大小:
1K<n<10K语言创建人:
found批注创建人:
machine-generated源数据集:
original许可:
gpl-3.0TURK is a multi-reference dataset for the evaluation of sentence simplification in English. The dataset consists of 2,359 sentences from the Parallel Wikipedia Simplification (PWKP) corpus . Each sentence is associated with 8 crowdsourced simplifications that focus on only lexical paraphrasing (no sentence splitting or deletion).
No Leaderboard for the task.
TURK contains English text only (BCP-47: en ).
An instance consists of an original sentence and 8 possible reference simplifications that focus on lexical paraphrasing.
{'original': 'one side of the armed conflicts is composed mainly of the sudanese military and the janjaweed , a sudanese militia group recruited mostly from the afro-arab abbala tribes of the northern rizeigat region in sudan .', 'simplifications': ['one side of the armed conflicts is made of sudanese military and the janjaweed , a sudanese militia recruited from the afro-arab abbala tribes of the northern rizeigat region in sudan .', 'one side of the armed conflicts consist of the sudanese military and the sudanese militia group janjaweed .', 'one side of the armed conflicts is mainly sudanese military and the janjaweed , which recruited from the afro-arab abbala tribes .', 'one side of the armed conflicts is composed mainly of the sudanese military and the janjaweed , a sudanese militia group recruited mostly from the afro-arab abbala tribes in sudan .', 'one side of the armed conflicts is made up mostly of the sudanese military and the janjaweed , a sudanese militia group whose recruits mostly come from the afro-arab abbala tribes from the northern rizeigat region in sudan .', 'the sudanese military and the janjaweed make up one of the armed conflicts , mostly from the afro-arab abbal tribes in sudan .', 'one side of the armed conflicts is composed mainly of the sudanese military and the janjaweed , a sudanese militia group recruited mostly from the afro-arab abbala tribes of the northern rizeigat regime in sudan .', 'one side of the armed conflicts is composed mainly of the sudanese military and the janjaweed , a sudanese militia group recruited mostly from the afro-arab abbala tribes of the northern rizeigat region in sudan .']}
TURK does not contain a training set; many models use WikiLarge (Zhang and Lapata, 2017) or Wiki-Auto (Jiang et. al 2020) for training.
Each input sentence has 8 associated reference simplified sentences. 2,359 input sentences are randomly split into 2,000 validation and 359 test sentences.
Dev | Test | Total | |
---|---|---|---|
Input Sentences | 2000 | 359 | 2359 |
Reference Simplifications | 16000 | 2872 | 18872 |
The TURK dataset was constructed to evaluate the task of text simplification. It contains multiple human-written references that focus on only lexical simplification.
The input sentences in the dataset are extracted from the Parallel Wikipedia Simplification (PWKP) corpus .
Who are the source language producers?The references are crowdsourced from Amazon Mechanical Turk. The annotators were asked to provide simplifications without losing any information or splitting the input sentence. No other demographic or compensation information is provided in the paper.
The instructions given to the annotators are available in the paper.
Who are the annotators?The annotators are Amazon Mechanical Turk workers.
Since the dataset is created from English Wikipedia (August 22, 2009 version), all the information contained in the dataset is already in the public domain.
The dataset helps move forward the research towards text simplification by creating a higher quality validation and test dataset. Progress in text simplification in turn has the potential to increase the accessibility of written documents to wider audiences.
The dataset may contain some social biases, as the input sentences are based on Wikipedia. Studies have shown that the English Wikipedia contains both gender biases (Schmahl et al., 2020) and racial biases (Adams et al., 2019) .
Since the dataset contains only 2,359 sentences that are derived from Wikipedia, it is limited to a small subset of topics present on Wikipedia.
TURK was developed by researchers at the University of Pennsylvania. The work was supported by the NSF under grant IIS-1430651 and the NSF GRFP under grant 1232825.
GNU General Public License v3.0
@article{Xu-EtAl:2016:TACL, author = {Wei Xu and Courtney Napoles and Ellie Pavlick and Quanze Chen and Chris Callison-Burch}, title = {Optimizing Statistical Machine Translation for Text Simplification}, journal = {Transactions of the Association for Computational Linguistics}, volume = {4}, year = {2016}, url = {https://cocoxu.github.io/publications/tacl2016-smt-simplification.pdf}, pages = {401--415} }
Thanks to @mounicam for adding this dataset.