数据集:

qanastek/HoC

语言:

en

大小:

1K<n<10K

语言创建人:

found

源数据集:

original
中文

HoC : Hallmarks of Cancer Corpus

Dataset Summary

The Hallmarks of Cancer Corpus for text classification

The Hallmarks of Cancer (HOC) Corpus consists of 1852 PubMed publication abstracts manually annotated by experts according to a taxonomy. The taxonomy consists of 37 classes in a hierarchy. Zero or more class labels are assigned to each sentence in the corpus. The labels are found under the "labels" directory, while the tokenized text can be found under "text" directory. The filenames are the corresponding PubMed IDs (PMID).

In addition to the HOC corpus, we also have the Cancer Hallmarks Analytics Tool which classifes all of PubMed according to the HoC taxonomy.

Supported Tasks and Leaderboards

The dataset can be used to train a model for multi-class-classification .

Languages

The corpora consists of PubMed article only in english:

  • English - United States (en-US)

Load the dataset with HuggingFace

from datasets import load_dataset
dataset = load_dataset("qanastek/HoC")
validation = dataset["validation"]
print("First element of the validation set : ", validation[0])

Dataset Structure

Data Instances

{
  "document_id": "12634122_5",
  "text": "Genes that were overexpressed in OM3 included oncogenes , cell cycle regulators , and those involved in signal transduction , whereas genes for DNA repair enzymes and inhibitors of transformation and metastasis were suppressed .",
  "label": [9, 5, 0, 6]
}

Data Fields

document_id : Unique identifier of the document.

text : Raw text of the PubMed abstracts.

label : One of the 10 currently known hallmarks of cancer.

Hallmark Search term
1. Sustaining proliferative signaling (PS) Proliferation Receptor Cancer
'Growth factor' Cancer
'Cell cycle' Cancer
2. Evading growth suppressors (GS) 'Cell cycle' Cancer
'Contact inhibition'
3. Resisting cell death (CD) Apoptosis Cancer
Necrosis Cancer
Autophagy Cancer
4. Enabling replicative immortality (RI) Senescence Cancer
Immortalization Cancer
5. Inducing angiogenesis (A) Angiogenesis Cancer
'Angiogenic factor'
6. Activating invasion & metastasis (IM) Metastasis Invasion Cancer
7. Genome instability & mutation (GI) Mutation Cancer
'DNA repair' Cancer
Adducts Cancer
'Strand breaks' Cancer
'DNA damage' Cancer
8. Tumor-promoting inflammation (TPI) Inflammation Cancer
'Oxidative stress' Cancer
Inflammation 'Immune response' Cancer
9. Deregulating cellular energetics (CE) Glycolysis Cancer; 'Warburg effect' Cancer
10. Avoiding immune destruction (ID) 'Immune system' Cancer
Immunosuppression Cancer

Data Splits

Distribution of data for the 10 hallmarks:

Hallmark No. abstracts No. sentences
1. PS 462 993
2. GS 242 468
3. CD 430 883
4. RI 115 295
5. A 143 357
6. IM 291 667
7. GI 333 771
8. TPI 194 437
9. CE 105 213
10. ID 108 226

Dataset Creation

Source Data

Who are the source language producers?

The corpus has been produced and uploaded by Baker Simon and Silins Ilona and Guo Yufan and Ali Imran and Hogberg Johan and Stenius Ulla and Korhonen Anna.

Personal and Sensitive Information

The corpora is free of personal or sensitive information.

Additional Information

Dataset Curators

HoC : Baker Simon and Silins Ilona and Guo Yufan and Ali Imran and Hogberg Johan and Stenius Ulla and Korhonen Anna

Hugging Face : Labrak Yanis (Not affiliated with the original corpus)

Licensing Information

GNU General Public License v3.0
Permissions
- Commercial use
- Modification
- Distribution
- Patent use
- Private use
Limitations
- Liability
- Warranty
Conditions
- License and copyright notice
- State changes
- Disclose source
- Same license

Citation Information

We would very much appreciate it if you cite our publications:

Automatic semantic classification of scientific literature according to the hallmarks of cancer

@article{baker2015automatic,
  title={Automatic semantic classification of scientific literature according to the hallmarks of cancer},
  author={Baker, Simon and Silins, Ilona and Guo, Yufan and Ali, Imran and H{\"o}gberg, Johan and Stenius, Ulla and Korhonen, Anna},
  journal={Bioinformatics},
  volume={32},
  number={3},
  pages={432--440},
  year={2015},
  publisher={Oxford University Press}
}

Cancer Hallmarks Analytics Tool (CHAT): a text mining approach to organize and evaluate scientific literature on cancer

@article{baker2017cancer,
  title={Cancer Hallmarks Analytics Tool (CHAT): a text mining approach to organize and evaluate scientific literature on cancer},
  author={Baker, Simon and Ali, Imran and Silins, Ilona and Pyysalo, Sampo and Guo, Yufan and H{\"o}gberg, Johan and Stenius, Ulla and Korhonen, Anna},
  journal={Bioinformatics},
  volume={33},
  number={24},
  pages={3973--3981},
  year={2017},
  publisher={Oxford University Press}
}

Cancer hallmark text classification using convolutional neural networks

@article{baker2017cancer,
  title={Cancer hallmark text classification using convolutional neural networks},
  author={Baker, Simon and Korhonen, Anna-Leena and Pyysalo, Sampo},
  year={2016}
}

Initializing neural networks for hierarchical multi-label text classification

@article{baker2017initializing,
  title={Initializing neural networks for hierarchical multi-label text classification},
  author={Baker, Simon and Korhonen, Anna},
  journal={BioNLP 2017},
  pages={307--315},
  year={2017}
}