数据集:

projecte-aina/casum

语言:

ca

计算机处理:

monolingual

语言创建人:

expert-generated

批注创建人:

machine-generated

预印本库:

arxiv:2202.06871
中文

Dataset Card for CaSum

Dataset Summary

CaSum is a summarization dataset. It is extracted from a newswire corpus crawled from the Catalan News Agency ( Agència Catalana de Notícies; ACN ). The corpus consists of 217,735 instances that are composed by the headline and the body.

Supported Tasks and Leaderboards

The dataset can be used to train a model for abstractive summarization. Success on this task is typically measured by achieving a high Rouge score. The mbart-base-ca-casum model currently achieves a 41.39.

Languages

The dataset is in Catalan ( ca-CA ).

Dataset Structure

Data Instances

{
  'summary': 'Mapfre preveu ingressar 31.000 milions d’euros al tancament de 2018',
  'text': 'L’asseguradora llançarà la seva filial Verti al mercat dels EUA a partir de 2017 ACN Madrid.-Mapfre preveu assolir uns ingressos de 31.000 milions d'euros al tancament de 2018 i destinarà a retribuir els seus accionistes com a mínim el 50% dels beneficis del grup durant el període 2016-2018, amb una rendibilitat mitjana a l’entorn del 5%, segons ha anunciat la companyia asseguradora durant la celebració aquest divendres de la seva junta general d’accionistes. La firma asseguradora també ha avançat que llançarà la seva filial d’automoció i llar al mercat dels EUA a partir de 2017. Mapfre ha recordat durant la junta que va pagar més de 540 milions d'euros en impostos el 2015, amb una taxa impositiva efectiva del 30,4 per cent. La companyia també ha posat en marxa el Pla de Sostenibilitat 2016-2018 i el Pla de Transparència Activa, “que han de contribuir a afermar la visió de Mapfre com a asseguradora global de confiança”, segons ha informat en un comunicat.'
}

Data Fields

  • summary (str): Summary of the piece of news
  • text (str): The text of the piece of news

Data Splits

We split our dataset into train, dev and test splits

  • train: 197,735 examples
  • validation: 10,000 examples
  • test: 10,000 examples

Dataset Creation

Curation Rationale

We created this corpus to contribute to the development of language models in Catalan, a low-resource language. There exist few resources for summarization in Catalan.

Source Data

Initial Data Collection and Normalization

We obtained each headline and its corresponding body of each news piece on the Catalan News Agency ( Agència Catalana de Notícies; ACN ) website and applied the following cleaning pipeline: deduplicating the documents, removing the documents with empty attributes, and deleting some boilerplate sentences.

Who are the source language producers?

The news portal Catalan News Agency ( Agència Catalana de Notícies; ACN ).

Annotations

The dataset is unannotated.

Annotation process

[N/A]

Who are the annotators?

[N/A]

Personal and Sensitive Information

Since all data comes from public websites, no anonymization process was performed.

Considerations for Using the Data

Social Impact of Dataset

We hope this corpus contributes to the development of summarization models in Catalan, a low-resource language.

Discussion of Biases

We are aware that since the data comes from unreliable web pages, some biases may be present in the dataset. Nonetheless, we have not applied any steps to reduce their impact.

Other Known Limitations

[N/A]

Additional Information

Dataset Curators

Text Mining Unit (TeMU) at the Barcelona Supercomputing Center ( bsc-temu@bsc.es )

This work was funded by MT4All CEF project and Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya within the framework of Projecte AINA .

Licensing information

Creative Commons Attribution 4.0 International .

BibTeX citation

If you use any of these resources (datasets or models) in your work, please cite our latest preprint:

@misc{degibert2022sequencetosequence,
      title={Sequence-to-Sequence Resources for Catalan}, 
      author={Ona de Gibert and Ksenia Kharitonova and Blanca Calvo Figueras and Jordi Armengol-Estapé and Maite Melero},
      year={2022},
      eprint={2202.06871},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contributions

[N/A]