数据集:
swedish_medical_ner
SwedMedNER is Named Entity Recognition dataset on medical text in Swedish. It consists three subsets which are in turn derived from three different sources respectively: the Swedish Wikipedia (a.k.a. wiki), Läkartidningen (a.k.a. lt), and 1177 Vårdguiden (a.k.a. 1177). While the Swedish Wikipedia and Läkartidningen subsets in total contains over 790000 sequences with 60 characters each, the 1177 Vårdguiden subset is manually annotated and contains 927 sentences, 2740 annotations, out of which 1574 are disorder and findings , 546 are pharmaceutical drug , and 620 are body structure .
Texts from both Swedish Wikipedia and Läkartidningen were automatically annotated using a list of medical seed terms. Sentences from 1177 Vårdguiden were manuually annotated.
Medical NER.
Swedish (SV).
Annotated example sentences are shown below:
( Förstoppning ) är ett vanligt problem hos äldre. [ Cox-hämmare ] finns även som gel och sprej. [ Medicinen ] kan också göra att man blöder lättare eftersom den påverkar { blodets } förmåga att levra sig.
Tags are as follows:
Data example:
In: data = load_dataset('./datasets/swedish_medical_ner', "wiki") In: data['train']: Out: Dataset({ features: ['sid', 'sentence', 'entities'], num_rows: 48720 }) In: data['train'][0]['sentence'] Out: '{kropp} beskrivs i till exempel människokroppen, anatomi och f' In: data['train'][0]['entities'] Out: {'start': [0], 'end': [7], 'text': ['kropp'], 'type': [2]}
In the original paper, its authors used the text from Läkartidningen for model training, Swedish Wikipedia for validation, and 1177.se for the final model evaluation.
[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
This dataset is released under the Creative Commons Attribution-ShareAlike 4.0 International Public License (CC BY-SA 4.0) .
@inproceedings{almgrenpavlovmogren2016bioner, title={Named Entity Recognition in Swedish Medical Journals with Deep Bidirectional Character-Based LSTMs}, author={Simon Almgren, Sean Pavlov, Olof Mogren}, booktitle={Proceedings of the Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM 2016)}, pages={1}, year={2016} }
Thanks to @bwang482 for adding this dataset.