You can find the main data card on the GEM Website .
MLSum is a multilingual summarization dataset crawled from different news websites. The GEM version supports the German and Spanish subset alongside specifically collected challenge sets for COVID-related articles to test out-of-domain generalization.
You can load the dataset via:
import datasets data = datasets.load_dataset('GEM/mlsum')
The data loader can be found here .
websiteN/A
paper authorsThomas Scialom, Paul-Alexis Dray, Sylvain Lamprier, Benjamin Piwowarski, Jacopo Staiano
@inproceedings{scialom-etal-2020-mlsum, title = "{MLSUM}: The Multilingual Summarization Corpus", author = "Scialom, Thomas and Dray, Paul-Alexis and Lamprier, Sylvain and Piwowarski, Benjamin and Staiano, Jacopo", booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2020.emnlp-main.647", doi = "10.18653/v1/2020.emnlp-main.647", pages = "8051--8067", abstract = "We present MLSUM, the first large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages {--} namely, French, German, Spanish, Russian, Turkish. Together with English news articles from the popular CNN/Daily mail dataset, the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community. We report cross-lingual comparative analyses based on state-of-the-art systems. These highlight existing biases which motivate the use of a multi-lingual dataset.", }Contact Name
Thomas Scialom
Contact Email{thomas,paul-alexis,jacopo}@recital.ai, {sylvain.lamprier,benjamin.piwowarski}@lip6.fr
Has a Leaderboard?no
yes
Covered DialectsThere is only one dialect per language, Hochdeutsch for German and Castilian Spanish for Spanish.
Covered LanguagesGerman , Spanish, Castilian
Whose Language?The German articles are crawled from Süddeutsche Zeitung and the Spanish ones from El Pais.
Licenseother: Other license
Intended UseThe intended use of this dataset is to augment existing datasets for English news summarization with additional languages.
Add. License InfoRestricted to non-commercial research purposes.
Primary TaskSummarization
Communicative GoalThe speaker is required to produce a high quality summary of news articles in the same language as the input article.
other
Curation Organization(s)CNRS, Sorbonne Université, reciTAL
Dataset CreatorsThomas Scialom, Paul-Alexis Dray, Sylvain Lamprier, Benjamin Piwowarski, Jacopo Staiano
FundingFunding information is not specified.
Who added the Dataset to GEM?The original data card was written by Pedro Henrique Martins (Instituto de Telecomunicações) and Sebastian Gehrmann (Google Research) extended and updated it to the v2 format. The COVID challenge set was created by Laura Perez-Beltrachini (University of Edinburgh). Data cleaning was done by Juan Diego Rodriguez (UT Austin).
The data fields are:
The structure follows previously released datasets. The topic and title fields were added to enable additional tasks like title generation and topic detection.
How were labels chosen?They are human written highlights or summaries scraped from the same website.
Example Instance{ 'date': '00/01/2010', 'gem_id': 'mlsum_de-train-2', 'gem_parent_id': 'mlsum_de-train-2', 'references': [], 'target': 'Oskar Lafontaine gibt den Parteivorsitz der Linken ab - und seine Kollegen streiten, wer ihn beerben soll. sueddeutsche.de stellt die derzeit aussichtsreichsten Anwärter für Führungsaufgaben vor. Mit Vote.', 'text': 'Wenn an diesem Montag die Landesvorsitzenden der Linken über die Nachfolger der derzeitigen Chefs Lothar Bisky und Oskar Lafontaine sowie des Bundesgeschäftsführers Dietmar Bartsch beraten, geht es nicht nur darum, wer die Partei führen soll. Es geht auch um die künftige Ausrichtung und Stärke einer Partei, die vor allem von Lafontaine zusammengehalten worden war. Ihm war es schließlich vor fünf Jahren gelungen, aus der ostdeutschen PDS und der westedeutschen WASG eine Partei zu formen. Eine Partei allerdings, die zerrissen ist in Ost und West, in Regierungswillige und ewige Oppositionelle, in Realos und Ideologen, in gemäßigte und radikale Linke. Wir stellen mögliche Kandidaten vor. Stimmen Sie ab: Wen halten Sie für geeignet und wen für unfähig? Kampf um Lafontaines Erbe: Gregor Gysi Sollte überhaupt jemand die Partei alleine führen, wie es sich viele Ostdeutsche wünschen, käme dafür wohl nur der 62-jährige Gregor Gysi in Betracht. Er ist nach Lafontaine einer der bekanntesten Politiker der Linken und derzeit Fraktionsvorsitzender der Partei im Bundestag. Allerdings ist der ehemalige PDS-Vorsitzende und Rechtsanwalt nach drei Herzinfarkten gesundheitlich angeschlagen. Wahrscheinlich wäre deshalb, dass er die zerstrittene Partei nur übergangsweise führt. Doch noch ist nicht klar, ob eine Person allein die Partei führen soll oder eine Doppelspitze. Viele Linke wünschen sich ein Duo aus einem westdeutschen und einem ostdeutschen Politiker, Mann und Frau. Foto: Getty Images', 'title': 'Personaldebatte bei der Linken - Wer kommt nach Lafontaine?', 'topic': 'politik', 'url': 'https://www.sueddeutsche.de/politik/personaldebatte-bei-der-linken-wer-kommt-nach-lafontaine-1.70041' }Data Splits
The statistics of the original dataset are:
| | Dataset | Train | Validation | Test | Mean article length | Mean summary length | | :--- | :----: | :---: | :---: | :---: | :---: | :---: | | German | 242,982 | 220,887 |11,394 |10,701 |570.6 (words) | 30.36 (words) | | Spanish | 290,645 | 266,367 |10,358 |13,920 |800.5 (words) |20.71 (words) |
The statistics of the cleaned version of the dataset are:
| | Dataset | Train | Validation | Test | | :--- | :----: | :---: | :---: | :---: | | German | 242,835 | 220,887 |11,392 |10,695 | | Spanish | 283,228 |259,886 |9,977 |13,365 |
The COVID challenge sets have 5058 (de) and 1938 (es) examples.
Splitting CriteriaThe training set contains data from 2010 to 2018. Data from 2019 (~10% of the dataset) is used for validation (up to May) and testing(May-December 2019).
Some topics are less represented within the dataset (e.g., Financial news in German and Television in Spanish).
As the first large-scale multilingual summarization dataset, it enables evaluation of summarization models beyond English.
Similar Datasetsyes
Unique Language Coverageyes
Difference from other GEM datasetsIn our configuration, the dataset is fully non-English.
Ability that the Dataset measuresContent Selection, Content Planning, Realization
yes
GEM Modificationsdata points removed , data points added
Modification DetailsThe modifications done to the original dataset are the following:
yes
Split InformationFor both selected languages (German and Spanish), we compiled time-shifted test data in the form of new articles for the second semester of 2020 with Covid19-related keywords. We collected articles from the same German and Spanish outlets as the original MLSUM datasets (El Pais and Süddeutsche Zeitung). We used the scripts provided for the re-creation of the MLSUM datasets . The new challenge test set for German contains 5058 instances and the Spanish one contains 1938.
We additionally sample 500 training and validation points as additional challenge sets to measure overfitting.
Split MotivationGeneralization to unseen topics.
Content Selection, Content Planning, Realization
MetricsMETEOR , ROUGE , Other: Other Metrics
Other MetricsNovelty: Number of generated n-grams not included in the source articles.
Proposed EvaluationROUGE and METEOR both measure n-gram overlap with a focus on recall and are standard summarization metrics. Novelty is often reported alongside them to characterize how much a model diverges from its inputs.
Previous results available?yes
Other Evaluation ApproachesThe GEM benchmark results ( https://gem-benchmark.com/results ) report a wide range of metrics include lexical overlap metrics but also semantic ones like BLEURT and BERT-Score.
The rationale was to create a multilingual news summarization dataset that mirrors the format of popular English datasets like XSum or CNN/DM.
Communicative GoalThe speaker is required to produce a high quality summary of news articles in the same language as the input article.
Sourced from Different Sourcesyes
Source Detailswww.lemonde.fr www.sueddeutsche.de www.elpais.com www.mk.ru www.internethaber.com
Found
Where was it found?Multiple websites
Language ProducersThe language producers are professional journalists.
Topics Covered4/5 of the original languages report their topics (except Turkish) and the distributions differ between sources. The dominant topics in German are Politik, Sport, Wirtschaft (economy). The dominant topics in Spanish are actualidad (current news) and opinion. French and Russian are different as well but we omit these languages in the GEM version.
Data Validationnot validated
Was Data Filtered?algorithmically
Filter CriteriaIn the original dataset, only one filter was applied: all the articles shorter than 50 words or summaries shorter than 10 words are discarded.
The GEM version additionally applies langID filter to ensure that articles are in the correct language.
none
Annotation Service?no
no
Justification for Using the DataThe copyright remains with the original data creators and the usage permission is restricted to non-commercial uses.
yes/very likely
Categories of PIIsensitive information , generic PII
Any PII Identification?no identification
no
no
no
no