数据集:
tner/bc5cdr
BioCreative V CDR NER dataset formatted in a part of TNER project. The original dataset consists of long documents which cannot be fed on LM because of the length, so we split them into sentences to reduce their size.
An example of train looks as follows.
{ 'tags': [2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0], 'tokens': ['Fasciculations', 'in', 'six', 'areas', 'of', 'the', 'body', 'were', 'scored', 'from', '0', 'to', '3', 'and', 'summated', 'as', 'a', 'total', 'fasciculation', 'score', '.'] }
The label2id dictionary can be found at here .
{ "O": 0, "B-Chemical": 1, "B-Disease": 2, "I-Disease": 3, "I-Chemical": 4 }
name | train | validation | test |
---|---|---|---|
bc5cdr | 5228 | 5330 | 5865 |
@article{wei2016assessing, title={Assessing the state of the art in biomedical relation extraction: overview of the BioCreative V chemical-disease relation (CDR) task}, author={Wei, Chih-Hsuan and Peng, Yifan and Leaman, Robert and Davis, Allan Peter and Mattingly, Carolyn J and Li, Jiao and Wiegers, Thomas C and Lu, Zhiyong}, journal={Database}, volume={2016}, year={2016}, publisher={Oxford Academic} }