数据集:

GEM/OrangeSum

任务:

摘要生成

语言:

计算机处理:

unknown

大小:

size_categories:unknown

语言创建人:

unknown

批注创建人:

unknown

源数据集:

original

许可:

other

数据集介绍文件清单

中文

Dataset Card for GEM/OrangeSum

Link to Main Data Card

You can find the main data card on the GEM Website .

Dataset Summary

OrangeSum is a French summarization dataset inspired by XSum. It features two subtasks: abstract generation and title generation. The data was sourced from "Orange Actu" articles between 2011 and 2020.

You can load the dataset via:

import datasets
data = datasets.load_dataset('GEM/OrangeSum')

The data loader can be found here .

paper

ACL Anthology

Dataset Overview

Where to find the Data and its Documentation

Download

Github

Paper

ACL Anthology

BibTex

@inproceedings{kamal-eddine-etal-2021-barthez,
    title = "{BART}hez: a Skilled Pretrained {F}rench Sequence-to-Sequence Model",
    author = "Kamal Eddine, Moussa  and
      Tixier, Antoine  and
      Vazirgiannis, Michalis",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-main.740",
    doi = "10.18653/v1/2021.emnlp-main.740",
    pages = "9369--9390",
    abstract = "Inductive transfer learning has taken the entire NLP field by storm, with models such as BERT and BART setting new state of the art on countless NLU tasks. However, most of the available models and research have been conducted for English. In this work, we introduce BARThez, the first large-scale pretrained seq2seq model for French. Being based on BART, BARThez is particularly well-suited for generative tasks. We evaluate BARThez on five discriminative tasks from the FLUE benchmark and two generative tasks from a novel summarization dataset, OrangeSum, that we created for this research. We show BARThez to be very competitive with state-of-the-art BERT-based French language models such as CamemBERT and FlauBERT. We also continue the pretraining of a multilingual BART on BARThez{'} corpus, and show our resulting model, mBARThez, to significantly boost BARThez{'} generative performance.",
}

Has a Leaderboard?

Languages and Intended Use

Multilingual?

Covered Languages

French

License

other: Other license

Primary Task

Summarization

Credit

Dataset Structure

Dataset in GEM

Rationale for Inclusion in GEM

Similar Datasets

GEM-Specific Curation

Modificatied for GEM?

Additional Splits?

Getting Started with the Task

Pointers to Resources

Papers about abstractive summarization using seq2seq models:

Papers about (pretrained) Transformers:

Technical Terms

No unique technical words in this data card.

Previous Results

Measured Model Abilities

The ability of the model to generate human like titles and abstracts for given news articles.

Metrics

ROUGE , BERT-Score

Proposed Evaluation

Automatic Evaluation: Rouge-1, Rouge-2, RougeL and BERTScore were used.

Human evalutaion: a human evaluation study was conducted with 11 French native speakers. The evaluators were PhD students from the computer science department of the university of the authors, working in NLP and other fields of AI. They volunteered after receiving an email announcement. the best-Worst Scaling (Louviere et al.,2015) was used. Two summaries from two different systems, along with their input document, were presented to a human annotator who had to decide which one was better. The evaluators were asked to base their judgments on accuracy (does the summary contain accurate facts?), informativeness (is important in-formation captured?) and fluency (is the summary written in well-formed French?).

Previous results available?

Broader Social Context

Previous Work on the Social Impact of the Dataset

Usage of Models based on the Data

Impact on Under-Served Communities

Addresses needs of underserved Communities?

Discussion of Biases

Any Documented Social Biases?

Are the Language Producers Representative of the Language?

The dataset contains news articles written by professional authors.

Considerations for Using the Data

PII Risks and Liability

Licenses

open license - commercial use allowed

Known Technical Limitations

作者:

GEM

数据集大小:

864.03 KB