数据集:

GEM/wiki_cat_sum

中文

Dataset Card for GEM/wiki_cat_sum

Link to Main Data Card

You can find the main data card on the GEM Website .

Dataset Summary

WikiCatSum is an English summarization dataset in three domains: animals, companies, and film. It provides multiple paragraphs of text paired with a summary of the paragraphs.

You can load the dataset via:

import datasets
data = datasets.load_dataset('GEM/wiki_cat_sum')

The data loader can be found here .

website

Github

paper

Arxiv

authors

Laura Perez-Beltrachini, Yang Liu, Mirella Lapata (University of Edinburgh) Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, Noam Shazeer (GoogleBrain)

Dataset Overview

Where to find the Data and its Documentation

Webpage

Github

Download

Website

Paper

Arxiv

BibTex
@inproceedings{perez-beltrachini-etal-2019-generating,
    title = "Generating Summaries with Topic Templates and Structured Convolutional Decoders",
    author = "Perez-Beltrachini, Laura  and
      Liu, Yang  and
      Lapata, Mirella",
    booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/P19-1504",
    doi = "10.18653/v1/P19-1504",
}
Contact Name

Laura Perez-Beltrachini

Contact Email

lperez@ed.ac.uk

Has a Leaderboard?

no

Languages and Intended Use

Multilingual?

no

Covered Languages

English

License

cc-by-sa-3.0: Creative Commons Attribution Share Alike 3.0 Unported

Intended Use

Research on multi-document abstractive summarisation.

Primary Task

Summarization

Communicative Goal

Summarise the most important facts of a given entity in the Film, Company, and Animal domains from a cluster of related documents.

Credit

Curation Organization Type(s)

industry , academic

Curation Organization(s)

Google Cloud Platform, University of Edinburgh

Dataset Creators

Laura Perez-Beltrachini, Yang Liu, Mirella Lapata (University of Edinburgh) Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, Noam Shazeer (GoogleBrain)

Funding

Google Cloud Platform, European Research Council

Who added the Dataset to GEM?

Ronald Cardenas (University of Edinburgh) Laura Perez-Beltrachini (University of Edinburgh)

Dataset Structure

Data Fields
  • id : ID of the data example
  • title : Is the Wikipedia article's title
  • paragraphs : Is the ranked list of paragraphs from the set of crawled texts
  • summary : Is constituted by a list of sentences together with their corresponding topic label
Example Instance

This is a truncated example from the animal setting:

{'gem_id': 'animal-train-1',
 'gem_parent_id': 'animal-train-1',
 'id': '2652',
 'paragraphs': ["lytrosis (hulst) of louisiana vernon antoine brou jr. 2005. southern lepidopterists' news, 27: 7 ., ..."],
 'references': ['lytrosis unitaria , the common lytrosis moth, is a species of moth of the geometridae family. it is found in north america, including arkansas, georgia, iowa , massachusetts, and wisconsin. the wingspan is about 50 mm. the larvae feed on rosa, crataegus, amelanchier, acer, quercus and viburnum species.'],
 'summary': {'text': ['lytrosis unitaria , the common lytrosis moth , is a species of moth of the geometridae family .',
   'it is found in north america , including arkansas , georgia , iowa , massachusetts , new hampshire , new jersey , new york , north carolina , ohio , oklahoma , ontario , pennsylvania , south carolina , tennessee , texas , virginia , west virginia and wisconsin .',
   'the wingspan is about 50 mm .',
   'the larvae feed on rosa , crataegus , amelanchier , acer , quercus and viburnum species . '],
  'topic': [29, 20, 9, 8]},
 'target': 'lytrosis unitaria , the common lytrosis moth, is a species of moth of the geometridae family. it is found in north america, including arkansas, georgia, iowa , massachusetts, and wisconsin. the wingspan is about 50 mm. the larvae feed on rosa, crataegus, amelanchier, acer, quercus and viburnum species.',
 'title': 'lytrosis unitaria'}
Data Splits

Nb of instances in train/valid/test are 50,938/2,855/2,831

Splitting Criteria

The data was split i.i.d., i.e. uniformly split into training, validation, and test datasets.

Dataset in GEM

Rationale for Inclusion in GEM

Why is the Dataset in GEM?

Evaluation of models' performance on noisy (document, summary) pairs and long inputs. Evaluate models' capabilities to generalise and mitigate biases.

Similar Datasets

no

Unique Language Coverage

no

Ability that the Dataset measures

Capabilities to generalise, mitigate biases, factual correctness.

GEM-Specific Curation

Modificatied for GEM?

yes

GEM Modifications

annotations added

Modification Details

We provide topic labels for summary sentences.

Additional Splits?

no

Getting Started with the Task

Pointers to Resources

And all references in these papers.

Previous Results

Previous Results

Measured Model Abilities

Capabilities to generalise, mitigate biases, factual correctness.

Metrics

ROUGE , BERT-Score , MoverScore , Other: Other Metrics

Other Metrics
  • Abstract/Copy
  • Factual accuracy based on the score of (Goodrich et al., 2019) and the relation extraction system of (Sorokin and Gurevych, 2017).
Proposed Evaluation

Human based are Question Answering and Ranking (Content, Fluency and Repetition)

Previous results available?

yes

Other Evaluation Approaches

Those listed above.

Relevant Previous Results

Generating Summaries with Topic Templates and Structured Convolutional Decoders https://arxiv.org/abs/1906.04687

Noisy Self-Knowledge Distillation for Text Summarization https://arxiv.org/abs/2009.07032

Dataset Curation

Original Curation

Original Curation Rationale

The dataset is a subset of the WikiSum (Liu et al., 2018) dataset focusing on summaries of entities in three domains (Film, Company, and Animal). It is multi-document summarisation where input-output pairs for each example entity are created as follows. The input is a set of paragraphs collected from i) documents in the Reference section of the entity's Wikipedia page plus ii) documents collected from the top ten search results after querying Google search engine with the entity name. The output summary is the Wikipedia abstract for the entity.

Communicative Goal

Generate descriptive summaries with specific domains, where certain topics are discussed and generally in specific orders.

Sourced from Different Sources

yes

Source Details

WikiSum (Liu et al., 2018)

Language Data

How was Language Data Obtained?

Other

Topics Covered

The dataset and task focuses on summaries for entities in three domains: Company, Film, and Animal.

Data Validation

not validated

Data Preprocessing

Summary sentences are associated with a topic label. There is a topic model for each domain.

Was Data Filtered?

not filtered

Structured Annotations

Additional Annotations?

automatically created

Annotation Service?

no

Annotation Values

Each summary sentences was annotated with a topic label. There is a topic model for each of the three domains. This was used to guide a hierarchical decoder.

Any Quality Control?

validated by data curators

Quality Control Details

Manual inspection of a sample of topics assigned to sentences. The number of topics was selected based on the performance of the summarisation model.

Consent

Any Consent Policy?

no

Justification for Using the Data

The dataset is base on Wikipedia and referenced and retrieved documents crawled from the Web.

Private Identifying Information (PII)

Contains PII?

unlikely

Any PII Identification?

no identification

Maintenance

Any Maintenance Plan?

no

Broader Social Context

Previous Work on the Social Impact of the Dataset

Usage of Models based on the Data

no

Impact on Under-Served Communities

Addresses needs of underserved Communities?

no

Discussion of Biases

Any Documented Social Biases?

yes

Links and Summaries of Analysis Work

This dataset is based on Wikipedia and thus biases analysis on other Wikipedia-based datasets are potentially true for WikiCatSum. For instance, see analysis for the ToTTo dataset here [1].

[1] Automatic Construction of Evaluation Suites for Natural Language Generation Datasets https://openreview.net/forum?id=CSi1eu_2q96

Considerations for Using the Data

PII Risks and Liability

Licenses

Copyright Restrictions on the Dataset

public domain

Copyright Restrictions on the Language Data

public domain

Known Technical Limitations