数据集:
bigbio/hprd50
HPRD50 is a dataset of randomly selected, hand-annotated abstracts of biomedical papers referenced by the Human Protein Reference Database (HPRD). It is parsed in XML format, splitting each abstract into sentences, and in each sentence there may be entities and interactions between those entities. In this particular dataset, entities are all proteins and interactions are thus protein-protein interactions.
Moreover, all entities are normalized to the HPRD database. These normalized terms are stored in each entity's 'type' attribute in the source XML. This means the dataset can determine e.g. that "Janus kinase 2" and "Jak2" are referencing the same normalized entity.
Because the dataset contains entities and relations, it is suitable for Named Entity Recognition and Relation Extraction.
@article{fundel2007relex, title={RelEx—Relation extraction using dependency parse trees}, author={Fundel, Katrin and K{"u}ffner, Robert and Zimmer, Ralf}, journal={Bioinformatics}, volume={23}, number={3}, pages={365--371}, year={2007}, publisher={Oxford University Press} }