数据集:
projecte-aina/casum
任务:
摘要生成语言:
ca计算机处理:
monolingual语言创建人:
expert-generated批注创建人:
machine-generated预印本库:
arxiv:2202.06871许可:
cc-by-nc-4.0CaSum is a summarization dataset. It is extracted from a newswire corpus crawled from the Catalan News Agency ( Agència Catalana de Notícies; ACN ). The corpus consists of 217,735 instances that are composed by the headline and the body.
The dataset can be used to train a model for abstractive summarization. Success on this task is typically measured by achieving a high Rouge score. The mbart-base-ca-casum model currently achieves a 41.39.
The dataset is in Catalan ( ca-CA ).
{ 'summary': 'Mapfre preveu ingressar 31.000 milions d’euros al tancament de 2018', 'text': 'L’asseguradora llançarà la seva filial Verti al mercat dels EUA a partir de 2017 ACN Madrid.-Mapfre preveu assolir uns ingressos de 31.000 milions d'euros al tancament de 2018 i destinarà a retribuir els seus accionistes com a mínim el 50% dels beneficis del grup durant el període 2016-2018, amb una rendibilitat mitjana a l’entorn del 5%, segons ha anunciat la companyia asseguradora durant la celebració aquest divendres de la seva junta general d’accionistes. La firma asseguradora també ha avançat que llançarà la seva filial d’automoció i llar al mercat dels EUA a partir de 2017. Mapfre ha recordat durant la junta que va pagar més de 540 milions d'euros en impostos el 2015, amb una taxa impositiva efectiva del 30,4 per cent. La companyia també ha posat en marxa el Pla de Sostenibilitat 2016-2018 i el Pla de Transparència Activa, “que han de contribuir a afermar la visió de Mapfre com a asseguradora global de confiança”, segons ha informat en un comunicat.' }
We split our dataset into train, dev and test splits
We created this corpus to contribute to the development of language models in Catalan, a low-resource language. There exist few resources for summarization in Catalan.
We obtained each headline and its corresponding body of each news piece on the Catalan News Agency ( Agència Catalana de Notícies; ACN ) website and applied the following cleaning pipeline: deduplicating the documents, removing the documents with empty attributes, and deleting some boilerplate sentences.
Who are the source language producers?The news portal Catalan News Agency ( Agència Catalana de Notícies; ACN ).
The dataset is unannotated.
Annotation process[N/A]
Who are the annotators?[N/A]
Since all data comes from public websites, no anonymization process was performed.
We hope this corpus contributes to the development of summarization models in Catalan, a low-resource language.
We are aware that since the data comes from unreliable web pages, some biases may be present in the dataset. Nonetheless, we have not applied any steps to reduce their impact.
[N/A]
Text Mining Unit (TeMU) at the Barcelona Supercomputing Center ( bsc-temu@bsc.es )
This work was funded by MT4All CEF project and Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya within the framework of Projecte AINA .
Creative Commons Attribution 4.0 International .
If you use any of these resources (datasets or models) in your work, please cite our latest preprint:
@misc{degibert2022sequencetosequence, title={Sequence-to-Sequence Resources for Catalan}, author={Ona de Gibert and Ksenia Kharitonova and Blanca Calvo Figueras and Jordi Armengol-Estapé and Maite Melero}, year={2022}, eprint={2202.06871}, archivePrefix={arXiv}, primaryClass={cs.CL} }
[N/A]