数据集:
GEM/wiki_cat_sum
任务:
摘要生成语言:
en计算机处理:
unknown语言创建人:
unknown批注创建人:
automatically-created源数据集:
original许可:
cc-by-sa-3.0You can find the main data card on the GEM Website .
WikiCatSum is an English summarization dataset in three domains: animals, companies, and film. It provides multiple paragraphs of text paired with a summary of the paragraphs.
You can load the dataset via:
import datasets data = datasets.load_dataset('GEM/wiki_cat_sum')
The data loader can be found here .
website paper authorsLaura Perez-Beltrachini, Yang Liu, Mirella Lapata (University of Edinburgh) Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, Noam Shazeer (GoogleBrain)
@inproceedings{perez-beltrachini-etal-2019-generating, title = "Generating Summaries with Topic Templates and Structured Convolutional Decoders", author = "Perez-Beltrachini, Laura and Liu, Yang and Lapata, Mirella", booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics", month = jul, year = "2019", address = "Florence, Italy", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/P19-1504", doi = "10.18653/v1/P19-1504", }Contact Name
Laura Perez-Beltrachini
Contact Emaillperez@ed.ac.uk
Has a Leaderboard?no
no
Covered LanguagesEnglish
Licensecc-by-sa-3.0: Creative Commons Attribution Share Alike 3.0 Unported
Intended UseResearch on multi-document abstractive summarisation.
Primary TaskSummarization
Communicative GoalSummarise the most important facts of a given entity in the Film, Company, and Animal domains from a cluster of related documents.
industry , academic
Curation Organization(s)Google Cloud Platform, University of Edinburgh
Dataset CreatorsLaura Perez-Beltrachini, Yang Liu, Mirella Lapata (University of Edinburgh) Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, Noam Shazeer (GoogleBrain)
FundingGoogle Cloud Platform, European Research Council
Who added the Dataset to GEM?Ronald Cardenas (University of Edinburgh) Laura Perez-Beltrachini (University of Edinburgh)
This is a truncated example from the animal setting:
{'gem_id': 'animal-train-1', 'gem_parent_id': 'animal-train-1', 'id': '2652', 'paragraphs': ["lytrosis (hulst) of louisiana vernon antoine brou jr. 2005. southern lepidopterists' news, 27: 7 ., ..."], 'references': ['lytrosis unitaria , the common lytrosis moth, is a species of moth of the geometridae family. it is found in north america, including arkansas, georgia, iowa , massachusetts, and wisconsin. the wingspan is about 50 mm. the larvae feed on rosa, crataegus, amelanchier, acer, quercus and viburnum species.'], 'summary': {'text': ['lytrosis unitaria , the common lytrosis moth , is a species of moth of the geometridae family .', 'it is found in north america , including arkansas , georgia , iowa , massachusetts , new hampshire , new jersey , new york , north carolina , ohio , oklahoma , ontario , pennsylvania , south carolina , tennessee , texas , virginia , west virginia and wisconsin .', 'the wingspan is about 50 mm .', 'the larvae feed on rosa , crataegus , amelanchier , acer , quercus and viburnum species . '], 'topic': [29, 20, 9, 8]}, 'target': 'lytrosis unitaria , the common lytrosis moth, is a species of moth of the geometridae family. it is found in north america, including arkansas, georgia, iowa , massachusetts, and wisconsin. the wingspan is about 50 mm. the larvae feed on rosa, crataegus, amelanchier, acer, quercus and viburnum species.', 'title': 'lytrosis unitaria'}Data Splits
Nb of instances in train/valid/test are 50,938/2,855/2,831
Splitting CriteriaThe data was split i.i.d., i.e. uniformly split into training, validation, and test datasets.
Evaluation of models' performance on noisy (document, summary) pairs and long inputs. Evaluate models' capabilities to generalise and mitigate biases.
Similar Datasetsno
Unique Language Coverageno
Ability that the Dataset measuresCapabilities to generalise, mitigate biases, factual correctness.
yes
GEM Modificationsannotations added
Modification DetailsWe provide topic labels for summary sentences.
Additional Splits?no
And all references in these papers.
Capabilities to generalise, mitigate biases, factual correctness.
MetricsROUGE , BERT-Score , MoverScore , Other: Other Metrics
Other MetricsHuman based are Question Answering and Ranking (Content, Fluency and Repetition)
Previous results available?yes
Other Evaluation ApproachesThose listed above.
Relevant Previous ResultsGenerating Summaries with Topic Templates and Structured Convolutional Decoders https://arxiv.org/abs/1906.04687
Noisy Self-Knowledge Distillation for Text Summarization https://arxiv.org/abs/2009.07032
The dataset is a subset of the WikiSum (Liu et al., 2018) dataset focusing on summaries of entities in three domains (Film, Company, and Animal). It is multi-document summarisation where input-output pairs for each example entity are created as follows. The input is a set of paragraphs collected from i) documents in the Reference section of the entity's Wikipedia page plus ii) documents collected from the top ten search results after querying Google search engine with the entity name. The output summary is the Wikipedia abstract for the entity.
Communicative GoalGenerate descriptive summaries with specific domains, where certain topics are discussed and generally in specific orders.
Sourced from Different Sourcesyes
Source DetailsWikiSum (Liu et al., 2018)
Other
Topics CoveredThe dataset and task focuses on summaries for entities in three domains: Company, Film, and Animal.
Data Validationnot validated
Data PreprocessingSummary sentences are associated with a topic label. There is a topic model for each domain.
Was Data Filtered?not filtered
automatically created
Annotation Service?no
Annotation ValuesEach summary sentences was annotated with a topic label. There is a topic model for each of the three domains. This was used to guide a hierarchical decoder.
Any Quality Control?validated by data curators
Quality Control DetailsManual inspection of a sample of topics assigned to sentences. The number of topics was selected based on the performance of the summarisation model.
no
Justification for Using the DataThe dataset is base on Wikipedia and referenced and retrieved documents crawled from the Web.
unlikely
Any PII Identification?no identification
no
no
no
yes
Links and Summaries of Analysis WorkThis dataset is based on Wikipedia and thus biases analysis on other Wikipedia-based datasets are potentially true for WikiCatSum. For instance, see analysis for the ToTTo dataset here [1].
[1] Automatic Construction of Evaluation Suites for Natural Language Generation Datasets https://openreview.net/forum?id=CSi1eu_2q96
public domain
Copyright Restrictions on the Language Datapublic domain