数据集:

tner/bc5cdr

语言:

en

计算机处理:

monolingual

大小:

10K<n<100K

许可:

other
中文

Dataset Card for "tner/bc5cdr"

Dataset Summary

BioCreative V CDR NER dataset formatted in a part of TNER project. The original dataset consists of long documents which cannot be fed on LM because of the length, so we split them into sentences to reduce their size.

  • Entity Types: Chemical , Disease

Dataset Structure

Data Instances

An example of train looks as follows.

{
    'tags': [2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0],
    'tokens': ['Fasciculations', 'in', 'six', 'areas', 'of', 'the', 'body', 'were', 'scored', 'from', '0', 'to', '3', 'and', 'summated', 'as', 'a', 'total', 'fasciculation', 'score', '.']
}

Label ID

The label2id dictionary can be found at here .

{
    "O": 0,
    "B-Chemical": 1,
    "B-Disease": 2,
    "I-Disease": 3,
    "I-Chemical": 4
}

Data Splits

name train validation test
bc5cdr 5228 5330 5865

Citation Information

@article{wei2016assessing,
  title={Assessing the state of the art in biomedical relation extraction: overview of the BioCreative V chemical-disease relation (CDR) task},
  author={Wei, Chih-Hsuan and Peng, Yifan and Leaman, Robert and Davis, Allan Peter and Mattingly, Carolyn J and Li, Jiao and Wiegers, Thomas C and Lu, Zhiyong},
  journal={Database},
  volume={2016},
  year={2016},
  publisher={Oxford Academic}
}