数据集:
GEM/wiki_lingua
You can find the main data card on the GEM Website .
Placeholder
You can load the dataset via:
import datasets data = datasets.load_dataset('GEM/wiki_lingua')
The data loader can be found here .
websiteNone (See Repository)
paperhttps://www.aclweb.org/anthology/2020.findings-emnlp.360/
authorsFaisal Ladhak (Columbia University), Esin Durmus (Stanford University), Claire Cardie (Cornell University), Kathleen McKeown (Columbia University)
None (See Repository)
Downloadhttps://github.com/esdurmus/Wikilingua
Paperhttps://www.aclweb.org/anthology/2020.findings-emnlp.360/
BibTex@inproceedings{ladhak-etal-2020-wikilingua, title = "{W}iki{L}ingua: A New Benchmark Dataset for Cross-Lingual Abstractive Summarization", author = "Ladhak, Faisal and Durmus, Esin and Cardie, Claire and McKeown, Kathleen", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = " https://aclanthology.org/2020.findings-emnlp.360" , doi = "10.18653/v1/2020.findings-emnlp.360", pages = "4034--4048", abstract = "We introduce WikiLingua, a large-scale, multilingual dataset for the evaluation of cross-lingual abstractive summarization systems. We extract article and summary pairs in 18 languages from WikiHow, a high quality, collaborative resource of how-to guides on a diverse set of topics written by human authors. We create gold-standard article-summary alignments across languages by aligning the images that are used to describe each how-to step in an article. As a set of baselines for further studies, we evaluate the performance of existing cross-lingual abstractive summarization methods on our dataset. We further propose a method for direct cross-lingual summarization (i.e., without requiring translation at inference time) by leveraging synthetic data and Neural Machine Translation as a pre-training step. Our method significantly outperforms the baseline approaches, while being more cost efficient during inference.", }
Contact NameFaisal Ladhak, Esin Durmus
Contact Emailfaisal@cs.columbia.edu , esdurmus@stanford.edu
Has a Leaderboard?no
yes
Covered DialectsDataset does not have multiple dialects per language.
Covered LanguagesEnglish , Spanish, Castilian , Portuguese , French , German , Russian , Italian , Indonesian , Dutch, Flemish , Arabic , Chinese , Vietnamese , Thai , Japanese , Korean , Hindi , Czech , Turkish
Whose Language?No information about the user demographic is available.
Licensecc-by-nc-sa-3.0: Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
Intended UseThe dataset was intended to serve as a large-scale, high-quality benchmark dataset for cross-lingual summarization.
Primary TaskSummarization
Communicative GoalProduce a high quality summary for the given input article.
academic
Curation Organization(s)Columbia University
Dataset CreatorsFaisal Ladhak (Columbia University), Esin Durmus (Stanford University), Claire Cardie (Cornell University), Kathleen McKeown (Columbia University)
Who added the Dataset to GEM?Jenny Chim (Queen Mary University of London), Faisal Ladhak (Columbia University)
gem_id -- The id for the data instance. source_language -- The language of the source article. target_language -- The language of the target summary. source -- The source document.
Example Instance{ "gem_id": "wikilingua_crosslingual-train-12345", "gem_parent_id": "wikilingua_crosslingual-train-12345", "source_language": "fr", "target_language": "de", "source": "Document in fr", "target": "Summary in de", }
Data SplitsThe data is split into train/dev/test. In addition to the full test set, there's also a sampled version of the test set.
Splitting CriteriaThe data was split to ensure the same document would appear in the same split across languages so as to ensure there's no leakage into the test set.
This dataset provides a large-scale, high-quality resource for cross-lingual summarization in 18 languages, increasing the coverage of languages for the GEM summarization task.
Similar Datasetsyes
Unique Language Coverageyes
Difference from other GEM datasetsXSum covers English news articles, and MLSum covers news articles in German and Spanish. In contrast, this dataset has how-to articles in 18 languages, substantially increasing the languages covered. Moreover, it also provides a a different domain than the other two datasets.
Ability that the Dataset measuresThe ability to generate quality summaries across multiple languages.
yes
GEM Modificationsother
Modification DetailsPrevious version had separate data loaders for each language. In this version, we've created a single monolingual data loader, which contains monolingual data in each of the 18 languages. In addition, we've also created a single cross-lingual data loader across all the language pairs in the dataset.
Additional Splits?no
Ability to summarize content across different languages.
MetricsROUGE
Proposed EvaluationROUGE is used to measure content selection by comparing word overlap with reference summaries. In addition, the authors of the dataset also used human evaluation to evaluate content selection and fluency of the systems.
Previous results available?no
The dataset was created in order to enable new approaches for cross-lingual and multilingual summarization, which are currently understudied as well as open up inetersting new directions for research in summarization. E.g., exploration of multi-source cross-lingual architectures, i.e. models that can summarize from multiple source languages into a target language, building models that can summarize articles from any language to any other language for a given set of languages.
Communicative GoalGiven an input article, produce a high quality summary of the article in the target language.
Sourced from Different Sourcesno
Found
Where was it found?Single website
Language ProducersWikiHow, which is an online resource of how-to guides (written and reviewed by human authors) is used as the data source.
Topics CoveredThe articles cover 19 broad categories including health, arts and entertainment, personal care and style, travel, education and communications, etc. The categories cover a broad set of genres and topics.
Data Validationnot validated
Was Data Filtered?not filtered
none
Annotation Service?no
yes
Consent Policy Details(1) Text Content. All text posted by Users to the Service is sub-licensed by wikiHow to other Users under a Creative Commons license as provided herein. The Creative Commons license allows such text content be used freely for non-commercial purposes, so long as it is used and attributed to the original author as specified under the terms of the license. Allowing free republication of our articles helps wikiHow achieve its mission by providing instruction on solving the problems of everyday life to more people for free. In order to support this goal, wikiHow hereby grants each User of the Service a license to all text content that Users contribute to the Service under the terms and conditions of a Creative Commons CC BY-NC-SA 3.0 License. Please be sure to read the terms of the license carefully. You continue to own all right, title, and interest in and to your User Content, and you are free to distribute it as you wish, whether for commercial or non-commercial purposes.
Other Consented Downstream UseThe data is made freely available under the Creative Commons license, therefore there are no restrictions about downstream uses as long is it's for non-commercial purposes.
no PII
Justification for no PIIOnly the article text and summaries were collected. No user information was retained in the dataset.
no
yes - other datasets featuring the same task
no
yes
non-commercial use only
Copyright Restrictions on the Language Datanon-commercial use only