You can find the main data card on the GEM Website .
XSum is an English news summarization dataset where the task is to predict the first sentence of an article from the rest of it.
You can load the dataset via:
import datasets data = datasets.load_dataset('GEM/xsum')
The data loader can be found here .
websiten/a
paper authorsShashi Narayan, Shay B. Cohen, Mirella Lapata (all affiliated with University of Edinburgh at the time of dataset creation)
@InProceedings{xsum-emnlp, author = "Shashi Narayan and Shay B. Cohen and Mirella Lapata", title = "Don't Give Me the Details, Just the Summary! {T}opic-Aware Convolutional Neural Networks for Extreme Summarization", booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing ", year = "2018", address = "Brussels, Belgium", }Contact Name
Shashi Narayan
Contact Emailshashinarayan@google.com
Has a Leaderboard?no
no
Covered DialectsSince the source of the dataset are BBC articles, the language is in British English of the variation written by journalists.
Covered LanguagesEnglish
Whose Language?Professional journalists
Licensecc-by-sa-4.0: Creative Commons Attribution Share Alike 4.0 International
Intended UseThe dataset is for the task of abstractive summarization in its extreme form, its about summarizing a document in a single sentence. The idea is to create a short, one-sentence news summary answering the question "What is the article about?".
Primary TaskSummarization
Communicative GoalGiven a news article, produce a single sentence summary of the content of the article.
academic
Curation Organization(s)University of Edinburgh
Dataset CreatorsShashi Narayan, Shay B. Cohen, Mirella Lapata (all affiliated with University of Edinburgh at the time of dataset creation)
FundingEuropean Research Council (Lapata; award number 681760), the European Union under the Horizon 2020 SUMMA project (Narayan, Cohen; grant agreement 688139), and Huawei Technologies (Cohen).
Who added the Dataset to GEM?The original data card was written by Laura Perez-Beltrachini and the data loader by Yacine Jernite. Sebastian Gehrmann migrated the data card to the new format and extended it. The v2 data loader was migrated by Abinaya Mahendiran
The Document/Summary format is standard for summarization datasets.
How were labels chosen?The labels are the first sentence of the source article.
Example Instance{ 'document': 'The researchers have sequenced the genome of a strain of bacterium that causes the virulent infection.\nA survey in 2007 showed that bleeding canker had spread rapidly, with almost half of the two million horse chestnuts displaying symptoms of the disease.\nThe findings have been published in the journal PLoS One.\nA visible symptom of the disease is a lesion on the bark, which oozes a resin on to the trunk or sometimes the branches.\nThe bark underneath the canker is killed, and if cankers manage to go all the way around the trunk then the horse chestnut (Aesculus hippocastanum) will die because it cuts off the food supply. [...]', 'target': "A team of UK scientists hopes to shed light on the mysteries of bleeding canker, a disease that is threatening the nation's horse chestnut trees.", }Data Splits
Section | Number of Documents |
---|---|
Training | 204,045 |
Validation | 11,332 |
Testing | 11,334 |
Total | 226k |
Section | number of words | number of sentences |
---|---|---|
Documents | 431.07 | 19.77 |
Summary | 23.26 | 1.00 |
The identifiers in the URLs were used to randomly split the dataset into training (90%, 204,045), validation (5%, 11,332), and test (5%, 11,334) sets.
Comparable datasets are often very extractive which is not a strategy that works for one-sentence summaries. The dataset curators thus created this dataset as a way to evaluate truly abstractive models
Communicative GoalSame as the communicative goal in GEM: A model should summarize a news article in a single sentence
Sourced from Different Sourcesno
Found
Where was it found?Single website
Language ProducersThe data was collected from articles between 2010 and 2017. No other information
Topics CoveredThe collected articles included the following topics: News, Politics, Sports, Weather, Business, Technology, Science, Health, Family, Education, Entertainment and Arts
The dataset curators also used LDA to gain insight into this question and found that the following were the top keywords associated with each topic:
not validated
Data PreprocessingThe text was extracted from the HTML of the webpage. No further processing was done.
Was Data Filtered?not filtered
none
Annotation Service?no
no
Justification for Using the DataThe copyright license of the data allows reusing it for this purpose.
yes/very likely
Categories of PIIgeneric PII
Any PII Identification?no identification
no
no
no
unsure
Are the Language Producers Representative of the Language?The language and content of the data is focused on news and language in the UK and as such not representative of the speakers world-wide. Existing selection biases of the BBC exist in this dataset.