数据集:
GEM/xlsum
任务:
摘要生成语言:
language:und计算机处理:
unknown语言创建人:
unknown批注创建人:
none源数据集:
original预印本库:
arxiv:1607.01759许可:
cc-by-nc-sa-4.0You can find the main data card on the GEM Website .
XLSum is a highly multilingual summarization dataset supporting 44 language. The data stems from BBC news articles.
You can load the dataset via:
import datasets data = datasets.load_dataset('GEM/xlsum')
The data loader can be found here .
website paper@inproceedings{hasan-etal-2021-xl, title = "{XL}-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages", author = "Hasan, Tahmid and Bhattacharjee, Abhik and Islam, Md. Saiful and Mubasshir, Kazi and Li, Yuan-Fang and Kang, Yong-Bin and Rahman, M. Sohel and Shahriyar, Rifat", booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021", month = aug, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.findings-acl.413", pages = "4693--4703", }Contact Name
Tahmid Hasan
Contact Emailtahmidhasan@cse.buet.ac.bd
Has a Leaderboard?yes
Leaderboard Link Leaderboard DetailsThe leaderboard ranks models based on ROUGE scores (R1/R2/RL) of the generated summaries.
yes
Covered LanguagesAmharic , Arabic , Azerbaijani , Bengali, Bangla , Burmese , Chinese (family) , English , French , Gujarati , Hausa , Hindi , Igbo , Indonesian , Japanese , Rundi , Korean , Kirghiz, Kyrgyz , Marathi , Nepali (individual language) , Oromo , Pushto, Pashto , Persian , Ghanaian Pidgin English , Portuguese , Panjabi, Punjabi , Russian , Scottish Gaelic, Gaelic , Serbian , Romano-Serbian , Sinhala, Sinhalese , Somali , Spanish, Castilian , Swahili (individual language), Kiswahili , Tamil , Telugu , Thai , Tigrinya , Turkish , Ukrainian , Urdu , Uzbek , Vietnamese , Welsh , Yoruba
Licensecc-by-nc-sa-4.0: Creative Commons Attribution Non Commercial Share Alike 4.0 International
Intended UseAbstractive summarization has centered around the English language, as most large abstractive summarization datasets are available in English only. Though there have been some recent efforts for curating multilingual abstractive summarization datasets, they are limited in terms of the number of languages covered, the number of training samples, or both. To this end, XL-Sum presents a large-scale abstractive summarization dataset of 1.35 million news articles from 45 languages crawled from the British Broadcasting Corporation website. It is intended to be used for both multilingual and per-language summarization tasks.
Primary TaskSummarization
Communicative GoalSummarize news-like text in one of 45 languages.
academic
Curation Organization(s)Bangladesh University of Engineering and Technology
Who added the Dataset to GEM?Tahmid Hasan (Bangladesh University of Engineering and Technology), Abhik Bhattacharjee (Bangladesh University of Engineering and Technology)
{ "gem_id": "GEM-xlsum_english-train-1589", "url": "[BBC news](https://www.bbc.com/news)/technology-17657859", "title": "Yahoo files e-book advert system patent applications", "summary": "Yahoo has signalled it is investigating e-book adverts as a way to stimulate its earnings.", "text": "Yahoo's patents suggest users could weigh the type of ads against the sizes of discount before purchase. It says in two US patent applications that ads for digital book readers have been \"less than optimal\" to date. The filings suggest that users could be offered titles at a variety of prices depending on the ads' prominence They add that the products shown could be determined by the type of book being read, or even the contents of a specific chapter, phrase or word. The paperwork was published by the US Patent and Trademark Office late last week and relates to work carried out at the firm's headquarters in Sunnyvale, California. \"Greater levels of advertising, which may be more valuable to an advertiser and potentially more distracting to an e-book reader, may warrant higher discounts,\" it states. Free books It suggests users could be offered ads as hyperlinks based within the book's text, in-laid text or even \"dynamic content\" such as video. Another idea suggests boxes at the bottom of a page could trail later chapters or quotes saying \"brought to you by Company A\". It adds that the more willing the customer is to see the ads, the greater the potential discount. \"Higher frequencies... may even be great enough to allow the e-book to be obtained for free,\" it states. The authors write that the type of ad could influence the value of the discount, with \"lower class advertising... such as teeth whitener advertisements\" offering a cheaper price than \"high\" or \"middle class\" adverts, for things like pizza. The inventors also suggest that ads could be linked to the mood or emotional state the reader is in as a they progress through a title. For example, they say if characters fall in love or show affection during a chapter, then ads for flowers or entertainment could be triggered. The patents also suggest this could applied to children's books - giving the Tom Hanks animated film Polar Express as an example. It says a scene showing a waiter giving the protagonists hot drinks \"may be an excellent opportunity to show an advertisement for hot cocoa, or a branded chocolate bar\". Another example states: \"If the setting includes young characters, a Coke advertisement could be provided, inviting the reader to enjoy a glass of Coke with his book, and providing a graphic of a cool glass.\" It adds that such targeting could be further enhanced by taking account of previous titles the owner has bought. 'Advertising-free zone' At present, several Amazon and Kobo e-book readers offer full-screen adverts when the device is switched off and show smaller ads on their menu screens, but the main text of the titles remains free of marketing. Yahoo does not currently provide ads to these devices, and a move into the area could boost its shrinking revenues. However, Philip Jones, deputy editor of the Bookseller magazine, said that the internet firm might struggle to get some of its ideas adopted. \"This has been mooted before and was fairly well decried,\" he said. \"Perhaps in a limited context it could work if the merchandise was strongly related to the title and was kept away from the text. \"But readers - particularly parents - like the fact that reading is an advertising-free zone. Authors would also want something to say about ads interrupting their narrative flow.\"" }Data Splits
The splits in the dataset are specified by the language names, which are as follows:
We used a 80%-10%-10% split for all languages with a few exceptions. English was split 93%-3.5%-3.5% for the evaluation set size to resemble that of CNN/DM and XSum ; Scottish Gaelic , Kyrgyz and Sinhala had relatively fewer samples, their evaluation sets were increased to 500 samples for more reliable evaluation. Same articles were used for evaluation in the two variants of Chinese and Serbian to prevent data leakage in multilingual training. Individual dataset download links with train-dev-test example counts are given below:
Language | ISO 639-1 Code | BBC subdomain(s) | Train | Dev | Test | Total |
---|---|---|---|---|---|---|
Amharic | am | BBC amharic | 5761 | 719 | 719 | 7199 |
Arabic | ar | BBC arabic | 37519 | 4689 | 4689 | 46897 |
Azerbaijani | az | BBC azeri | 6478 | 809 | 809 | 8096 |
Bengali | bn | BBC bengali | 8102 | 1012 | 1012 | 10126 |
Burmese | my | BBC burmese | 4569 | 570 | 570 | 5709 |
Chinese (Simplified) | zh-CN | BBC ukchina /simp, BBC zhongwen /simp | 37362 | 4670 | 4670 | 46702 |
Chinese (Traditional) | zh-TW | BBC ukchina /trad, BBC zhongwen /trad | 37373 | 4670 | 4670 | 46713 |
English | en | BBC english , BBC sinhala * | 306522 | 11535 | 11535 | 329592 |
French | fr | BBC afrique | 8697 | 1086 | 1086 | 10869 |
Gujarati | gu | BBC gujarati | 9119 | 1139 | 1139 | 11397 |
Hausa | ha | BBC hausa | 6418 | 802 | 802 | 8022 |
Hindi | hi | BBC hindi | 70778 | 8847 | 8847 | 88472 |
Igbo | ig | BBC igbo | 4183 | 522 | 522 | 5227 |
Indonesian | id | BBC indonesia | 38242 | 4780 | 4780 | 47802 |
Japanese | ja | BBC japanese | 7113 | 889 | 889 | 8891 |
Kirundi | rn | BBC gahuza | 5746 | 718 | 718 | 7182 |
Korean | ko | BBC korean | 4407 | 550 | 550 | 5507 |
Kyrgyz | ky | BBC kyrgyz | 2266 | 500 | 500 | 3266 |
Marathi | mr | BBC marathi | 10903 | 1362 | 1362 | 13627 |
Nepali | np | BBC nepali | 5808 | 725 | 725 | 7258 |
Oromo | om | BBC afaanoromoo | 6063 | 757 | 757 | 7577 |
Pashto | ps | BBC pashto | 14353 | 1794 | 1794 | 17941 |
Persian | fa | BBC persian | 47251 | 5906 | 5906 | 59063 |
Pidgin ** | pcm | BBC pidgin | 9208 | 1151 | 1151 | 11510 |
Portuguese | pt | BBC portuguese | 57402 | 7175 | 7175 | 71752 |
Punjabi | pa | BBC punjabi | 8215 | 1026 | 1026 | 10267 |
Russian | ru | BBC russian , BBC ukrainian * | 62243 | 7780 | 7780 | 77803 |
Scottish Gaelic | gd | BBC naidheachdan | 1313 | 500 | 500 | 2313 |
Serbian (Cyrillic) | sr | BBC serbian /cyr | 7275 | 909 | 909 | 9093 |
Serbian (Latin) | sr | BBC serbian /lat | 7276 | 909 | 909 | 9094 |
Sinhala | si | BBC sinhala | 3249 | 500 | 500 | 4249 |
Somali | so | BBC somali | 5962 | 745 | 745 | 7452 |
Spanish | es | BBC mundo | 38110 | 4763 | 4763 | 47636 |
Swahili | sw | BBC swahili | 7898 | 987 | 987 | 9872 |
Tamil | ta | BBC tamil | 16222 | 2027 | 2027 | 20276 |
Telugu | te | BBC telugu | 10421 | 1302 | 1302 | 13025 |
Thai | th | BBC thai | 6616 | 826 | 826 | 8268 |
Tigrinya | ti | BBC tigrinya | 5451 | 681 | 681 | 6813 |
Turkish | tr | BBC turkce | 27176 | 3397 | 3397 | 33970 |
Ukrainian | uk | BBC ukrainian | 43201 | 5399 | 5399 | 53999 |
Urdu | ur | BBC urdu | 67665 | 8458 | 8458 | 84581 |
Uzbek | uz | BBC uzbek | 4728 | 590 | 590 | 5908 |
Vietnamese | vi | BBC vietnamese | 32111 | 4013 | 4013 | 40137 |
Welsh | cy | BBC cymrufyw | 9732 | 1216 | 1216 | 12164 |
Yoruba | yo | BBC yoruba | 6350 | 793 | 793 | 7936 |
* A lot of articles in BBC Sinhala and BBC Ukrainian were written in English and Russian respectively. They were identified using Fasttext and moved accordingly. ** West African Pidgin English
Traditional abstractive text summarization has been centered around English and other high-resource languages. XL-Sum provides a large collection of high-quality article-summary pairs for 45 languages where the languages range from high-resource to extremely low-resource. This enables the research community to explore the summarization capabilities of different models for multiple languages and languages in isolation. We believe the addition of XL-Sum to GEM makes the domain of abstractive text summarization more diversified and inclusive to the research community. We hope our efforts in this work will encourage the community to push the boundaries of abstractive text summarization beyond the English language, especially for low and mid-resource languages, bringing technological advances to communities of these languages that have been traditionally under-served.
Similar Datasetsyes
Unique Language Coverageyes
Difference from other GEM datasetsThe summaries are highly concise and abstractive.
Ability that the Dataset measuresConciseness, abstractiveness, and overall summarization capability.
no
Additional Splits?no
Conciseness, abstractiveness, and overall summarization capability.
MetricsROUGE
Proposed EvaluationROUGE is the de facto evaluation metric used for text summarization. However, it was designed specifically for evaluating English texts. Due to the nature of the metric, scores are heavily dependent on text tokenization / stemming / unnecessary character removal, etc. Some modifications to the original ROUGE evaluation were done such as punctuation only removal, language specific tokenization/stemming to enable reliable comparison of source and target summaries across different scripts.
Previous results available?no
State-of-the-art text summarization models are heavily data-driven, i.e., a large number of article-summary pairs are required to train them effectively. As a result, abstractive summarization has centered around the English language, as most large abstractive summarization datasets are available in English only. Though there have been some recent efforts for curating multilingual abstractive summarization datasets, they are limited in terms of the number of languages covered, the number of training samples, or both. To this end, we curate XL-Sum , a large-scale abstractive summarization dataset of 1.35 million news articles from 45 languages crawled from the British Broadcasting Corporation website.
Communicative GoalIntroduce new languages in the english-centric domain of abstractive text summarization and enable both multilingual and per-language summarization.
Sourced from Different Sourcesyes
Source DetailsBritish Broadcasting Corporation (BBC) news websites.
Found
Where was it found?Multiple websites
Language ProducersThe language content was written by professional news editors hired by BBC.
Topics CoveredNews
Data Validationnot validated
Data PreprocessingWe used 'NFKC' normalization on all text instances.
Was Data Filtered?algorithmically
Filter CriteriaWe designed a crawler to recursively crawl pages starting from the homepage by visiting different article links present in each page visited. We were able to take advantage of the fact that all BBC sites have somewhat similar structures, and were able to scrape articles from all sites. We discarded pages with no textual contents (mostly pages consisting of multimedia contents) before further processing. We designed a number of heuristics to make the extraction effective by carefully examining the HTML structures of the crawled pages:
none
Annotation Service?no
yes
Consent Policy DetailsBBC's policy specifies that the text content within its websites can be used for non-commercial research only.
likely
Categories of PIIgeneric PII
Any PII Identification?no identification
no
no
yes
Details on how Dataset Addresses the NeedsThis dataset introduces summarization corpus for many languages where there weren't any datasets like this curated before.
no
Are the Language Producers Representative of the Language?Yes
research use only , non-commercial use only
Copyright Restrictions on the Language Dataresearch use only , non-commercial use only
Human evaluation showed most languages had a high percentage of good summaries in the upper nineties, almost none of the summaries contained any conflicting information, while about one-third on average had information that was not directly inferrable from the source article. Since generally multiple articles are written regarding an important event, there could be an overlap between the training and evaluation data in terms on content.
Unsuited ApplicationsThe dataset is limited to news domain only. Hence it wouldn't be advisable to use a model trained on this dataset for summarizing texts from a different domain i.e. literature, scientific text etc. Another pitfall could be hallucinations in the model generated summary.
Discouraged Use CasesROUGE evaluates the quality of the summary as a whole by considering up to 4-gram overlaps. Therefore, in an article about India if the word "India" in the generated summary gets replaced by "Pakistan" due to model hallucination, the overall score wouldn't be reduced significantly, but the entire meaning could get changed.