数据集:

GEM/mlsum

任务:

摘要生成

语言:

计算机处理:

unknown

大小:

size_categories:unknown

语言创建人:

unknown

批注创建人:

none

源数据集:

original

许可:

other

数据集介绍文件清单

中文

Dataset Card for GEM/mlsum

Link to Main Data Card

You can find the main data card on the GEM Website .

Dataset Summary

MLSum is a multilingual summarization dataset crawled from different news websites. The GEM version supports the German and Spanish subset alongside specifically collected challenge sets for COVID-related articles to test out-of-domain generalization.

You can load the dataset via:

import datasets
data = datasets.load_dataset('GEM/mlsum')

The data loader can be found here .

website

N/A

paper

ACL Anthology

authors

Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier, Benjamin Piwowarski, Jacopo Staiano

Dataset Overview

Where to find the Data and its Documentation

Download

Gitlab

Paper

ACL Anthology

BibTex

@inproceedings{scialom-etal-2020-mlsum,
    title = "{MLSUM}: The Multilingual Summarization Corpus",
    author = "Scialom, Thomas  and
      Dray, Paul-Alexis  and
      Lamprier, Sylvain  and
      Piwowarski, Benjamin  and
      Staiano, Jacopo",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.emnlp-main.647",
    doi = "10.18653/v1/2020.emnlp-main.647",
    pages = "8051--8067",
    abstract = "We present MLSUM, the first large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages {--} namely, French, German, Spanish, Russian, Turkish. Together with English news articles from the popular CNN/Daily mail dataset, the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community. We report cross-lingual comparative analyses based on state-of-the-art systems. These highlight existing biases which motivate the use of a multi-lingual dataset.",
}

Contact Name

Thomas Scialom

Contact Email

{thomas,paul-alexis,jacopo}@recital.ai, {sylvain.lamprier,benjamin.piwowarski}@lip6.fr

Has a Leaderboard?

Languages and Intended Use

Multilingual?

yes

Covered Dialects

There is only one dialect per language, Hochdeutsch for German and Castilian Spanish for Spanish.

Covered Languages

German , Spanish, Castilian

Whose Language?

The German articles are crawled from Süddeutsche Zeitung and the Spanish ones from El Pais.

License

other: Other license

Intended Use

The intended use of this dataset is to augment existing datasets for English news summarization with additional languages.

Add. License Info

Restricted to non-commercial research purposes.

Primary Task

Summarization

Communicative Goal

The speaker is required to produce a high quality summary of news articles in the same language as the input article.

Credit

Curation Organization Type(s)

other

Curation Organization(s)

CNRS, Sorbonne Université, reciTAL

Dataset Creators

Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier, Benjamin Piwowarski, Jacopo Staiano

Funding

Funding information is not specified.

Who added the Dataset to GEM?

The original data card was written by Pedro Henrique Martins (Instituto de Telecomunicações) and Sebastian Gehrmann (Google Research) extended and updated it to the v2 format. The COVID challenge set was created by Laura Perez-Beltrachini (University of Edinburgh). Data cleaning was done by Juan Diego Rodriguez (UT Austin).

Dataset Structure

Data Fields

The data fields are:

text : the source article ( string ).
summary : the output summary ( string ).
topic : the topic of the article ( string ).
url : the article's url ( string ).
title : the article's title ( string ).
date : the article's date ( string ).

Reason for Structure

The structure follows previously released datasets. The topic and title fields were added to enable additional tasks like title generation and topic detection.

How were labels chosen?

They are human written highlights or summaries scraped from the same website.

Example Instance

{
'date': '00/01/2010',
'gem_id': 'mlsum_de-train-2',
'gem_parent_id': 'mlsum_de-train-2',
'references': [],
'target': 'Oskar Lafontaine gibt den Parteivorsitz der Linken ab - und seine Kollegen streiten, wer ihn beerben soll. sueddeutsche.de stellt die derzeit aussichtsreichsten Anwärter für Führungsaufgaben vor. Mit Vote.',
'text': 'Wenn an diesem Montag die Landesvorsitzenden der Linken über die Nachfolger der derzeitigen Chefs Lothar Bisky und Oskar Lafontaine sowie des Bundesgeschäftsführers Dietmar Bartsch beraten, geht es nicht nur darum, wer die Partei führen soll. Es geht auch um die künftige Ausrichtung und Stärke einer Partei, die vor allem von Lafontaine zusammengehalten worden war. Ihm war es schließlich vor fünf Jahren gelungen, aus der ostdeutschen PDS und der westedeutschen WASG eine Partei zu formen. Eine Partei allerdings, die zerrissen ist in Ost und West, in Regierungswillige und ewige Oppositionelle, in Realos und Ideologen, in gemäßigte und radikale Linke. Wir stellen mögliche Kandidaten vor. Stimmen Sie ab: Wen halten Sie für geeignet und wen für unfähig? Kampf um Lafontaines Erbe: Gregor Gysi Sollte überhaupt jemand die Partei alleine führen, wie es sich viele Ostdeutsche wünschen, käme dafür wohl nur der 62-jährige Gregor Gysi in Betracht. Er ist nach Lafontaine einer der bekanntesten Politiker der Linken und derzeit Fraktionsvorsitzender der Partei im Bundestag. Allerdings ist der ehemalige PDS-Vorsitzende und Rechtsanwalt nach drei Herzinfarkten gesundheitlich angeschlagen. Wahrscheinlich wäre deshalb, dass er die zerstrittene Partei nur übergangsweise führt. Doch noch ist nicht klar, ob eine Person allein die Partei führen soll oder eine Doppelspitze. Viele Linke wünschen sich ein Duo aus einem westdeutschen und einem ostdeutschen Politiker, Mann und Frau. Foto: Getty Images',
'title': 'Personaldebatte bei der Linken - Wer kommt nach Lafontaine?',
'topic': 'politik',
'url': 'https://www.sueddeutsche.de/politik/personaldebatte-bei-der-linken-wer-kommt-nach-lafontaine-1.70041'
}

Data Splits

The statistics of the original dataset are:

| | Dataset | Train | Validation | Test | Mean article length | Mean summary length | | :--- | :----: | :---: | :---: | :---: | :---: | :---: | | German | 242,982 | 220,887 |11,394 |10,701 |570.6 (words) | 30.36 (words) | | Spanish | 290,645 | 266,367 |10,358 |13,920 |800.5 (words) |20.71 (words) |

The statistics of the cleaned version of the dataset are:

| | Dataset | Train | Validation | Test | | :--- | :----: | :---: | :---: | :---: | | German | 242,835 | 220,887 |11,392 |10,695 | | Spanish | 283,228 |259,886 |9,977 |13,365 |

The COVID challenge sets have 5058 (de) and 1938 (es) examples.

Splitting Criteria

The training set contains data from 2010 to 2018. Data from 2019 (~10% of the dataset) is used for validation (up to May) and testing(May-December 2019).

Some topics are less represented within the dataset (e.g., Financial news in German and Television in Spanish).

Dataset in GEM

Rationale for Inclusion in GEM

Why is the Dataset in GEM?

As the first large-scale multilingual summarization dataset, it enables evaluation of summarization models beyond English.

Similar Datasets

yes

Unique Language Coverage

yes

Difference from other GEM datasets

In our configuration, the dataset is fully non-English.

Ability that the Dataset measures

Content Selection, Content Planning, Realization

GEM-Specific Curation

Modificatied for GEM?

yes

GEM Modifications

data points removed , data points added

Modification Details

The modifications done to the original dataset are the following:

Selection of 2 languages (Spanish and German) out of the dataset 5 languages due to copyright restrictions.
Removal of duplicate articles.
Manually removal of article-summary pairs for which the summary is not related to the article.
Removal of article-summary pairs written in a different language (detected using the langdetect library).

Additional Splits?

yes

Split Information

For both selected languages (German and Spanish), we compiled time-shifted test data in the form of new articles for the second semester of 2020 with Covid19-related keywords. We collected articles from the same German and Spanish outlets as the original MLSUM datasets (El Pais and Süddeutsche Zeitung). We used the scripts provided for the re-creation of the MLSUM datasets . The new challenge test set for German contains 5058 instances and the Spanish one contains 1938.

We additionally sample 500 training and validation points as additional challenge sets to measure overfitting.

Split Motivation

Generalization to unseen topics.

Getting Started with the Task

Previous Results

Measured Model Abilities

Content Selection, Content Planning, Realization

Metrics

METEOR , ROUGE , Other: Other Metrics

Other Metrics

Novelty: Number of generated n-grams not included in the source articles.

Proposed Evaluation

ROUGE and METEOR both measure n-gram overlap with a focus on recall and are standard summarization metrics. Novelty is often reported alongside them to characterize how much a model diverges from its inputs.

Previous results available?

yes

Other Evaluation Approaches

The GEM benchmark results ( https://gem-benchmark.com/results ) report a wide range of metrics include lexical overlap metrics but also semantic ones like BLEURT and BERT-Score.

Dataset Curation

Original Curation

Original Curation Rationale

The rationale was to create a multilingual news summarization dataset that mirrors the format of popular English datasets like XSum or CNN/DM.

Communicative Goal

The speaker is required to produce a high quality summary of news articles in the same language as the input article.

Sourced from Different Sources

yes

Source Details

www.lemonde.fr www.sueddeutsche.de www.elpais.com www.mk.ru www.internethaber.com

Language Data

How was Language Data Obtained?

Found

Where was it found?

Multiple websites

Language Producers

The language producers are professional journalists.

Topics Covered

4/5 of the original languages report their topics (except Turkish) and the distributions differ between sources. The dominant topics in German are Politik, Sport, Wirtschaft (economy). The dominant topics in Spanish are actualidad (current news) and opinion. French and Russian are different as well but we omit these languages in the GEM version.

Data Validation

not validated

Was Data Filtered?

algorithmically

Filter Criteria

In the original dataset, only one filter was applied: all the articles shorter than 50 words or summaries shorter than 10 words are discarded.

The GEM version additionally applies langID filter to ensure that articles are in the correct language.

Structured Annotations

Additional Annotations?

none

Annotation Service?

Consent

Any Consent Policy?

Justification for Using the Data

The copyright remains with the original data creators and the usage permission is restricted to non-commercial uses.

Private Identifying Information (PII)

Contains PII?

yes/very likely

Categories of PII

sensitive information , generic PII

Any PII Identification?

no identification

Maintenance

Any Maintenance Plan?

Broader Social Context

Previous Work on the Social Impact of the Dataset

Usage of Models based on the Data

Impact on Under-Served Communities

Addresses needs of underserved Communities?

Discussion of Biases

Any Documented Social Biases?

作者:

GEM

数据集大小:

48.81 KB