The Hallmarks of Cancer Corpus for text classification
The Hallmarks of Cancer (HOC) Corpus consists of 1852 PubMed publication abstracts manually annotated by experts according to a taxonomy. The taxonomy consists of 37 classes in a hierarchy. Zero or more class labels are assigned to each sentence in the corpus. The labels are found under the "labels" directory, while the tokenized text can be found under "text" directory. The filenames are the corresponding PubMed IDs (PMID).
In addition to the HOC corpus, we also have the Cancer Hallmarks Analytics Tool which classifes all of PubMed according to the HoC taxonomy.
The dataset can be used to train a model for multi-class-classification .
The corpora consists of PubMed article only in english:
from datasets import load_dataset dataset = load_dataset("qanastek/HoC") validation = dataset["validation"] print("First element of the validation set : ", validation[0])
{ "document_id": "12634122_5", "text": "Genes that were overexpressed in OM3 included oncogenes , cell cycle regulators , and those involved in signal transduction , whereas genes for DNA repair enzymes and inhibitors of transformation and metastasis were suppressed .", "label": [9, 5, 0, 6] }
document_id : Unique identifier of the document.
text : Raw text of the PubMed abstracts.
label : One of the 10 currently known hallmarks of cancer.
Hallmark | Search term |
---|---|
1. Sustaining proliferative signaling (PS) | Proliferation Receptor Cancer |
'Growth factor' Cancer | |
'Cell cycle' Cancer | |
2. Evading growth suppressors (GS) | 'Cell cycle' Cancer |
'Contact inhibition' | |
3. Resisting cell death (CD) | Apoptosis Cancer |
Necrosis Cancer | |
Autophagy Cancer | |
4. Enabling replicative immortality (RI) | Senescence Cancer |
Immortalization Cancer | |
5. Inducing angiogenesis (A) | Angiogenesis Cancer |
'Angiogenic factor' | |
6. Activating invasion & metastasis (IM) | Metastasis Invasion Cancer |
7. Genome instability & mutation (GI) | Mutation Cancer |
'DNA repair' Cancer | |
Adducts Cancer | |
'Strand breaks' Cancer | |
'DNA damage' Cancer | |
8. Tumor-promoting inflammation (TPI) | Inflammation Cancer |
'Oxidative stress' Cancer | |
Inflammation 'Immune response' Cancer | |
9. Deregulating cellular energetics (CE) | Glycolysis Cancer; 'Warburg effect' Cancer |
10. Avoiding immune destruction (ID) | 'Immune system' Cancer |
Immunosuppression Cancer |
Distribution of data for the 10 hallmarks:
Hallmark | No. abstracts | No. sentences |
---|---|---|
1. PS | 462 | 993 |
2. GS | 242 | 468 |
3. CD | 430 | 883 |
4. RI | 115 | 295 |
5. A | 143 | 357 |
6. IM | 291 | 667 |
7. GI | 333 | 771 |
8. TPI | 194 | 437 |
9. CE | 105 | 213 |
10. ID | 108 | 226 |
The corpus has been produced and uploaded by Baker Simon and Silins Ilona and Guo Yufan and Ali Imran and Hogberg Johan and Stenius Ulla and Korhonen Anna.
The corpora is free of personal or sensitive information.
HoC : Baker Simon and Silins Ilona and Guo Yufan and Ali Imran and Hogberg Johan and Stenius Ulla and Korhonen Anna
Hugging Face : Labrak Yanis (Not affiliated with the original corpus)
GNU General Public License v3.0
Permissions - Commercial use - Modification - Distribution - Patent use - Private use Limitations - Liability - Warranty Conditions - License and copyright notice - State changes - Disclose source - Same license
We would very much appreciate it if you cite our publications:
Automatic semantic classification of scientific literature according to the hallmarks of cancer
@article{baker2015automatic, title={Automatic semantic classification of scientific literature according to the hallmarks of cancer}, author={Baker, Simon and Silins, Ilona and Guo, Yufan and Ali, Imran and H{\"o}gberg, Johan and Stenius, Ulla and Korhonen, Anna}, journal={Bioinformatics}, volume={32}, number={3}, pages={432--440}, year={2015}, publisher={Oxford University Press} }
@article{baker2017cancer, title={Cancer Hallmarks Analytics Tool (CHAT): a text mining approach to organize and evaluate scientific literature on cancer}, author={Baker, Simon and Ali, Imran and Silins, Ilona and Pyysalo, Sampo and Guo, Yufan and H{\"o}gberg, Johan and Stenius, Ulla and Korhonen, Anna}, journal={Bioinformatics}, volume={33}, number={24}, pages={3973--3981}, year={2017}, publisher={Oxford University Press} }
Cancer hallmark text classification using convolutional neural networks
@article{baker2017cancer, title={Cancer hallmark text classification using convolutional neural networks}, author={Baker, Simon and Korhonen, Anna-Leena and Pyysalo, Sampo}, year={2016} }
Initializing neural networks for hierarchical multi-label text classification
@article{baker2017initializing, title={Initializing neural networks for hierarchical multi-label text classification}, author={Baker, Simon and Korhonen, Anna}, journal={BioNLP 2017}, pages={307--315}, year={2017} }