Dataset Card for "tner/bc5cdr"

Dataset Summary

BioCreative V CDR NER dataset formatted in a part of TNER project. The original dataset consists of long documents which cannot be fed on LM because of the length, so we split them into sentences to reduce their size.

Entity Types: Chemical , Disease

Dataset Structure

Data Instances

An example of train looks as follows.

{
    'tags': [2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0],
    'tokens': ['Fasciculations', 'in', 'six', 'areas', 'of', 'the', 'body', 'were', 'scored', 'from', '0', 'to', '3', 'and', 'summated', 'as', 'a', 'total', 'fasciculation', 'score', '.']
}

Label ID

The label2id dictionary can be found at here .

{
    "O": 0,
    "B-Chemical": 1,
    "B-Disease": 2,
    "I-Disease": 3,
    "I-Chemical": 4
}

Data Splits

name	train	validation	test
bc5cdr	5228	5330	5865

Citation Information

@article{wei2016assessing,
  title={Assessing the state of the art in biomedical relation extraction: overview of the BioCreative V chemical-disease relation (CDR) task},
  author={Wei, Chih-Hsuan and Peng, Yifan and Leaman, Robert and Davis, Allan Peter and Mattingly, Carolyn J and Li, Jiao and Wiegers, Thomas C and Lu, Zhiyong},
  journal={Database},
  volume={2016},
  year={2016},
  publisher={Oxford Academic}
}

作者:

tner

数据集大小:

4.15 MB