数据集:
PlanTL-GOB-ES/WikiCAT_en
carlos.rodriguez1@bsc.es
Repository
https://github.com/TeMU-BSC/WikiCAT
WikiCAT_en is a English corpus for thematic Text Classification tasks. It is created automatically from Wikipedia and Wikidata sources, and contains 28921 article summaries from the Wikiipedia classified under 19 different categories.
This dataset was developed by BSC TeMU as part of the PlanTL project, and intended as an evaluation of LT capabilities to generate useful synthetic corpus.
Text classification, Language Model
EN - English
Two json files, one for each split.
We used a simple model with the article text and associated labels, without further metadata.
Example:{"version": "1.1.0", "data": [ { {'sentence': 'The IEEE Donald G. Fink Prize Paper Award was established in 1979 by the board of directors of the Institute of Electrical and Electronics Engineers (IEEE) in honor of Donald G. Fink. He was a past president of the Institute of Radio Engineers (IRE), and the first general manager and executive director of the IEEE. Recipients of this award received a certificate and an honorarium. The award was presented annually since 1981 and discontinued in 2016.', 'label': 'Engineering' }, . . . ] }Labels
'Health', 'Law', 'Entertainment', 'Religion', 'Business', 'Science', 'Engineering', 'Nature', 'Philosophy', 'Economy', 'Sports', 'Technology', 'Government', 'Mathematics', 'Military', 'Humanities', 'Music', 'Politics', 'History'
Se eligen páginas de partida “Category:” para representar los temas en cada lengua.
Se extrae para cada categoría las páginas principales, así como las subcategorías, y las páginas individuales bajo estas subcategorías de primer nivel. Para cada página, se extrae también el “summary” que proporciona Wikipedia.
The source data are Wikipedia page summaries and thematic categories
Who are the source language producers?Automatic annotation
No personal or sensitive information included.
[N/A]
[N/A]
[N/A]
Text Mining Unit (TeMU) at the Barcelona Supercomputing Center ( bsc-temu@bsc.es ).
For further information, send an email to ( plantl-gob-es@bsc.es ).
This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL .
This work is licensed under CC Attribution 4.0 International License.
Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
[N/A]