数据集:
projecte-aina/vilasum
任务:
摘要生成语言:
ca计算机处理:
monolingual语言创建人:
expert-generated批注创建人:
machine-generated预印本库:
arxiv:2202.06871许可:
cc-by-nc-4.0VilaSum is a summarization dataset for evaluation. It is extracted from a newswire corpus crawled from the Catalan news portal VilaWeb . The corpus consists of 13,843 instances that are composed by the headline and the body.
The dataset can be used to train a model for abstractive summarization. Success on this task is typically measured by achieving a high Rouge score. The mbart-base-ca-casum model currently achieves a 35.04.
The dataset is in Catalan ( ca-CA ).
{ 'summary': 'Un vídeo corrobora les agressions a dues animalistes en un correbou del Mas de Barberans', 'text': 'Noves imatges, a les quals ha tingut accés l'ACN, certifiquen les agressions i la destrucció del material d'enregistrament que han denunciat dues activistes d'AnimaNaturalis en la celebració d'un acte de bous a la plaça al Mas de Barberans (Montsià). En el vídeo es veu com unes quantes persones s'abalancen sobre les noies que reben estirades i cops mentre els intenten prendre les càmeres. Membres de la comissió taurina intervenen per aturar els presumptes agressors però es pot escoltar com part del públic victoreja la situació. Els Mossos d'Esquadra presentaran aquest dilluns al migdia l'atestat dels fets al Jutjat d'Amposta. Dissabte ja es van detenir quatre persones que van quedar en llibertat a l'espera de ser cridats pel jutge. Es tracta de tres homes i una dona de Sant Carles de la Ràpita, tots ells membres de la mateixa família.' }
Due to the reduced size of the dataset, we use it only for evaluation as a test set.
We created this corpus to contribute to the development of language models in Catalan, a low-resource language. There exist few resources for summarization in Catalan.
We obtained each headline and its corresponding body of each news piece on VilaWeb and applied the following cleaning pipeline: deduplicating the documents, removing the documents with empty attributes, and deleting some boilerplate sentences.
Who are the source language producers?The news portal VilaWeb .
The dataset is unannotated.
Annotation process[N/A]
Who are the annotators?[N/A]
Since all data comes from public websites, no anonymization process was performed.
We hope this corpus contributes to the development of summarization models in Catalan, a low-resource language.
We are aware that since the data comes from unreliable web pages, some biases may be present in the dataset. Nonetheless, we have not applied any steps to reduce their impact.
[N/A]
Text Mining Unit (TeMU) at the Barcelona Supercomputing Center ( bsc-temu@bsc.es )
This work was funded by MT4All CEF project and the Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya within the framework of Projecte AINA .
Creative Commons Attribution 4.0 International .
If you use any of these resources (datasets or models) in your work, please cite our latest preprint:
@misc{degibert2022sequencetosequence, title={Sequence-to-Sequence Resources for Catalan}, author={Ona de Gibert and Ksenia Kharitonova and Blanca Calvo Figueras and Jordi Armengol-Estapé and Maite Melero}, year={2022}, eprint={2202.06871}, archivePrefix={arXiv}, primaryClass={cs.CL} }
[N/A]