数据集:

swedish_medical_ner

语言:

sv

计算机处理:

monolingual

大小:

100K<n<1M

语言创建人:

found

源数据集:

original
中文

Dataset Card for swedish_medical_ner

Dataset Summary

SwedMedNER is Named Entity Recognition dataset on medical text in Swedish. It consists three subsets which are in turn derived from three different sources respectively: the Swedish Wikipedia (a.k.a. wiki), Läkartidningen (a.k.a. lt), and 1177 Vårdguiden (a.k.a. 1177). While the Swedish Wikipedia and Läkartidningen subsets in total contains over 790000 sequences with 60 characters each, the 1177 Vårdguiden subset is manually annotated and contains 927 sentences, 2740 annotations, out of which 1574 are disorder and findings , 546 are pharmaceutical drug , and 620 are body structure .

Texts from both Swedish Wikipedia and Läkartidningen were automatically annotated using a list of medical seed terms. Sentences from 1177 Vårdguiden were manuually annotated.

Supported Tasks and Leaderboards

Medical NER.

Languages

Swedish (SV).

Dataset Structure

Data Instances

Annotated example sentences are shown below:

( Förstoppning ) är ett vanligt problem hos äldre.
[ Cox-hämmare ] finns även som gel och sprej.
[ Medicinen ] kan också göra att man blöder lättare eftersom den påverkar { blodets } förmåga att levra sig.

Tags are as follows:

  • Prenthesis, (): Disorder and Finding
  • Brackets, []: Pharmaceutical Drug
  • Curly brackets, {}: Body Structure

Data example:

In: data = load_dataset('./datasets/swedish_medical_ner', "wiki")
In: data['train']:
Out: 
Dataset({
    features: ['sid', 'sentence', 'entities'],
    num_rows: 48720
})

In: data['train'][0]['sentence']
Out: '{kropp} beskrivs i till exempel människokroppen, anatomi och f'
In: data['train'][0]['entities']
Out: {'start': [0], 'end': [7], 'text': ['kropp'], 'type': [2]}

Data Fields

  • sentence
  • entities
    • start : the start index
    • end : the end index
    • text : the text of the entity
    • type : entity type: Disorder and Finding (0), Pharmaceutical Drug (1) or Body Structure (2)

Data Splits

In the original paper, its authors used the text from Läkartidningen for model training, Swedish Wikipedia for validation, and 1177.se for the final model evaluation.

Dataset Creation

Curation Rationale

Source Data

  • Swedish Wikipedia;
  • Läkartidningen - contains articles from the Swedish journal for medical professionals;
  • 1177.se - a web site provided by the Swedish public health care authorities, containing information, counselling, and other health-care services.
Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process
  • A list of seed terms was extracted using SweMeSH and SNOMED CT;
    • The following predefined categories was used for the extraction: disorder & finding (sjukdom & symtom), pharmaceutical drug (läkemedel) and body structure (kroppsdel)
  • For Swedish Wikipedia , an initial list of medical domain articles were selected manually. These source articles as well as their linked articles were downloaded and automatically annotated by finding the aforementioned seed terms with a context window of 60 characters;
  • Articles from the Läkartidningen corpus were downloaded and automatically annotated by finding the aforementioned seed terms with a context window of 60 characters;
  • 15 documents from 1177.se were downloaded in May 2016 and then manually annotated with the seed terms as support, resulting 2740 annotations.
Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

  • Simon Almgren, simonwalmgren@gmail.com
  • Sean Pavlov, sean.pavlov@gmail.com
  • Olof Mogren, olof@mogren.one Chalmers University of Technology

Licensing Information

This dataset is released under the Creative Commons Attribution-ShareAlike 4.0 International Public License (CC BY-SA 4.0) .

Citation Information

@inproceedings{almgrenpavlovmogren2016bioner,
  title={Named Entity Recognition in Swedish Medical Journals with Deep Bidirectional Character-Based LSTMs},
  author={Simon Almgren, Sean Pavlov, Olof Mogren},
  booktitle={Proceedings of the Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM 2016)},
  pages={1},
  year={2016}
}

Contributions

Thanks to @bwang482 for adding this dataset.