数据集:

GEM/xsum

语言:

en

计算机处理:

unknown

语言创建人:

unknown

批注创建人:

none

源数据集:

original
中文

Dataset Card for GEM/xsum

Link to Main Data Card

You can find the main data card on the GEM Website .

Dataset Summary

XSum is an English news summarization dataset where the task is to predict the first sentence of an article from the rest of it.

You can load the dataset via:

import datasets
data = datasets.load_dataset('GEM/xsum')

The data loader can be found here .

website

n/a

paper

ACL Anthology

authors

Shashi Narayan, Shay B. Cohen, Mirella Lapata (all affiliated with University of Edinburgh at the time of dataset creation)

Dataset Overview

Where to find the Data and its Documentation

Download

Github

Paper

ACL Anthology

BibTex
@InProceedings{xsum-emnlp,
  author =      "Shashi Narayan and Shay B. Cohen and Mirella Lapata",
  title =       "Don't Give Me the Details, Just the Summary! {T}opic-Aware Convolutional Neural Networks for Extreme Summarization",
  booktitle =   "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing ",
  year =        "2018",
  address =     "Brussels, Belgium",
}
Contact Name

Shashi Narayan

Contact Email

shashinarayan@google.com

Has a Leaderboard?

no

Languages and Intended Use

Multilingual?

no

Covered Dialects

Since the source of the dataset are BBC articles, the language is in British English of the variation written by journalists.

Covered Languages

English

Whose Language?

Professional journalists

License

cc-by-sa-4.0: Creative Commons Attribution Share Alike 4.0 International

Intended Use

The dataset is for the task of abstractive summarization in its extreme form, its about summarizing a document in a single sentence. The idea is to create a short, one-sentence news summary answering the question "What is the article about?".

Primary Task

Summarization

Communicative Goal

Given a news article, produce a single sentence summary of the content of the article.

Credit

Curation Organization Type(s)

academic

Curation Organization(s)

University of Edinburgh

Dataset Creators

Shashi Narayan, Shay B. Cohen, Mirella Lapata (all affiliated with University of Edinburgh at the time of dataset creation)

Funding

European Research Council (Lapata; award number 681760), the European Union under the Horizon 2020 SUMMA project (Narayan, Cohen; grant agreement 688139), and Huawei Technologies (Cohen).

Who added the Dataset to GEM?

The original data card was written by Laura Perez-Beltrachini and the data loader by Yacine Jernite. Sebastian Gehrmann migrated the data card to the new format and extended it. The v2 data loader was migrated by Abinaya Mahendiran

Dataset Structure

Data Fields
  • Document : Input news article.
  • Summary : One sentence summary of the article.
  • Id : BBC ID of the article.
Reason for Structure

The Document/Summary format is standard for summarization datasets.

How were labels chosen?

The labels are the first sentence of the source article.

Example Instance
{
  'document': 'The researchers have sequenced the genome of a strain of bacterium that causes the virulent infection.\nA survey in 2007 showed that bleeding canker had spread rapidly, with almost half of the two million horse chestnuts displaying symptoms of the disease.\nThe findings have been published in the journal PLoS One.\nA visible symptom of the disease is a lesion on the bark, which oozes a resin on to the trunk or sometimes the branches.\nThe bark underneath the canker is killed, and if cankers manage to go all the way around the trunk then the horse chestnut (Aesculus hippocastanum) will die because it cuts off the food supply. [...]',
  'target': "A team of UK scientists hopes to shed light on the mysteries of bleeding canker, a disease that is threatening the nation's horse chestnut trees.",
}
Data Splits
Section Number of Documents
Training 204,045
Validation 11,332
Testing 11,334
Total 226k
Section number of words number of sentences
Documents 431.07 19.77
Summary 23.26 1.00
Splitting Criteria

The identifiers in the URLs were used to randomly split the dataset into training (90%, 204,045), validation (5%, 11,332), and test (5%, 11,334) sets.

Dataset Curation

Original Curation

Original Curation Rationale

Comparable datasets are often very extractive which is not a strategy that works for one-sentence summaries. The dataset curators thus created this dataset as a way to evaluate truly abstractive models

Communicative Goal

Same as the communicative goal in GEM: A model should summarize a news article in a single sentence

Sourced from Different Sources

no

Language Data

How was Language Data Obtained?

Found

Where was it found?

Single website

Language Producers

The data was collected from articles between 2010 and 2017. No other information

Topics Covered

The collected articles included the following topics: News, Politics, Sports, Weather, Business, Technology, Science, Health, Family, Education, Entertainment and Arts

The dataset curators also used LDA to gain insight into this question and found that the following were the top keywords associated with each topic:

  • T1 : charge, court, murder, police, arrest, guilty, sentence, boy, bail, space, crown, trial
  • T2 : church, abuse, bishop, child, catholic, gay, pope, school, christian, priest, cardinal
  • T3 : council, people, government, local, housing, home, house, property, city, plan, authority
  • T4 : clinton, party, trump, climate, poll, vote, plaid, election, debate, change, candidate, campaign
  • T5 : country, growth, report, business, export, fall, bank, security, economy, rise, global, inflation
  • T6 : hospital, patient, trust, nhs, people, care, health, service, staff, report, review, system, child
Data Validation

not validated

Data Preprocessing

The text was extracted from the HTML of the webpage. No further processing was done.

Was Data Filtered?

not filtered

Structured Annotations

Additional Annotations?

none

Annotation Service?

no

Consent

Any Consent Policy?

no

Justification for Using the Data

The copyright license of the data allows reusing it for this purpose.

Private Identifying Information (PII)

Contains PII?

yes/very likely

Categories of PII

generic PII

Any PII Identification?

no identification

Maintenance

Any Maintenance Plan?

no

Broader Social Context

Previous Work on the Social Impact of the Dataset

Usage of Models based on the Data

no

Impact on Under-Served Communities

Addresses needs of underserved Communities?

no

Discussion of Biases

Any Documented Social Biases?

unsure

Are the Language Producers Representative of the Language?

The language and content of the data is focused on news and language in the UK and as such not representative of the speakers world-wide. Existing selection biases of the BBC exist in this dataset.