数据集:
LeoCordoba/CC-NEWS-ES-titles
CC-NEWS-ES-titles is a Spanish-language dataset for news titles generation. The text and titles comes from 2019 and 2020 CC-NEWS data (which is part of Common Crawl).
It contains 402.310 pairs of news title and body, splitted in :
Train: 370.125
Eval: 16.092
Test: 16.092
The text is in Spanish. The BCP-47 code for Spanish is es.
Each data instance contains the following features: text and output_text .
An example from the CC-NEWS-ES-titles train set looks like the following:
{'text': 'Hoy en el Boletín Oficial también se publicó la disposición para universidades, institutos universitarios y de educación superior de todas las jurisdicciones, a las que recomienda que "adecúen las condiciones en que se desarrolla la actividad académica presencial en el marco de la emergencia conforme con las recomendaciones del Ministerio de Salud", según lo publicado por la agencia ', 'output_text': 'Coronavirus: "Seguimos educando", la plataforma online para que los chicos estudien en cuarentena'}
The CC-NEWS-ES-titles dataset has 3 splits: train , validation , and test . The splits contain disjoint sets of news.
Dataset Split | Number of Instances in Split |
---|---|
Train | 370.125 |
Eval | 16.092 |
Test | 16.092 |
[N/A]
TODO
Who are the source language producers?Common Crawl: https://commoncrawl.org/
The dataset does not contain any additional annotations.
Annotation process[N/A]
Who are the annotators?[N/A]
[N/A]
Abstractive summarization is a complex task and Spanish is a underrepresented language in the NLP domain. As a consequence, adding a Spanish resource may help others to improve their research and educational activities.
[N/A]
[N/A]
This dataset is maintained by Leonardo Ignacio Córdoba and was built with the help of María Gaska .
[N/A]
TODO
[N/A]