数据集:

norne

语言:

no

计算机处理:

monolingual

大小:

10K<n<100K

语言创建人:

crowdsourced

批注创建人:

expert-generated

源数据集:

original

预印本库:

arxiv:1911.12146

许可:

other
中文

Dataset Card for NorNE: Norwegian Named Entities

Dataset Summary

NorNE is a manually annotated corpus of named entities which extends the annotation of the existing Norwegian Dependency Treebank. Comprising both of the official standards of written Norwegian (Bokmål and Nynorsk), the corpus contains around 600,000 tokens and annotates a rich set of entity types including persons,organizations, locations, geo-political entities, products, and events, in addition to a class corresponding to nominals derived from names.

There are 3 main configs in this dataset each with 3 versions of the NER tag set. When accessing the bokmaal , nynorsk , or combined configs the NER tag set will be comprised of 9 tags: GPE_ORG , GPE_LOC , ORG , LOC , PER , PROD , EVT , DRV , and MISC . The two special types GPE_LOC and GPE_ORG can easily be altered depending on the task, choosing either the more general GPE tag or the more specific LOC / ORG tags, conflating them with the other annotations of the same type. To access these reduced versions of the dataset, you can use the configs bokmaal-7 , nynorsk-7 , combined-7 for the NER tag set with 7 tags ( ORG , LOC , PER , PROD , EVT , DRV , MISC ), and bokmaal-8 , nynorsk-8 , combined-8 for the NER tag set with 8 tags ( LOC_ and ORG_ : ORG , LOC , GPE , PER , PROD , EVT , DRV , MISC ). By default, the full set (9 tags) will be used. See Annotations for further details.

Supported Tasks and Leaderboards

NorNE ads named entity annotations on top of the Norwegian Dependency Treebank.

Languages

Both Norwegian Bokmål ( bokmaal ) and Nynorsk ( nynorsk ) are supported as different configs in this dataset. An extra config for the combined languages is also included ( combined ). See the Annotation section for details on accessing reduced tag sets for the NER feature.

Dataset Structure

Each entry contains text sentences, their language, identifiers, tokens, lemmas, and corresponding NER and POS tag lists.

Data Instances

An example of the train split of the bokmaal config.

{'idx': '000001',
 'lang': 'bokmaal',
 'lemmas': ['lam', 'og', 'piggvar', 'på', 'bryllupsmeny'],
 'ner_tags': [0, 0, 0, 0, 0],
 'pos_tags': [0, 9, 0, 5, 0],
 'text': 'Lam og piggvar på bryllupsmenyen',
 'tokens': ['Lam', 'og', 'piggvar', 'på', 'bryllupsmenyen']}

Data Fields

Each entry is annotated with the next fields:

  • idx ( int ), text (sentence) identifier from the NorNE dataset
  • lang ( str ), language variety, either bokmaal , nynorsk or combined
  • text ( str ), plain text
  • tokens ( List[str] ), list of tokens extracted from text
  • lemmas ( List[str] ), list of lemmas extracted from tokens
  • ner_tags ( List[int] ), list of numeric NER tags for each token in tokens
  • pos_tags ( List[int] ), list of numeric PoS tags for each token in tokens

An example DataFrame obtained from the dataset:

idx lang text tokens lemmas ner_tags pos_tags
0 000001 bokmaal Lam og piggvar på bryllupsmenyen [Lam, og, piggvar, på, bryllupsmenyen] [lam, og, piggvar, på, bryllupsmeny] [0, 0, 0, 0, 0] [0, 9, 0, 5, 0]
1 000002 bokmaal Kamskjell, piggvar og lammefilet sto på menyen... [Kamskjell, ,, piggvar, og, lammefilet, sto, p... [kamskjell, $,, piggvar, og, lammefilet, stå, ... [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] [0, 1, 0, 9, 0, 15, 2, 0, 2, 8, 6, 0, 1]
2 000003 bokmaal Og til dessert: Parfait à la Mette-Marit. [Og, til, dessert, :, Parfait, à, la, Mette-Ma... [og, til, dessert, $:, Parfait, à, la, Mette-M... [0, 0, 0, 0, 7, 8, 8, 8, 0] [9, 2, 0, 1, 10, 12, 12, 10, 1]

Data Splits

There are three splits: train , validation and test .

Config Split Total
bokmaal train 15696
bokmaal validation 2410
bokmaal test 1939
nynorsk train 14174
nynorsk validation 1890
nynorsk test 1511
combined test 29870
combined validation 4300
combined test 3450

Dataset Creation

Curation Rationale

  • A name in this context is close to Saul Kripke's definition of a name , in that a name has a unique reference and its meaning is constant (there are exceptions in the annotations, e.g. "Regjeringen" (en. "Government")).
  • It is the usage of a name that determines the entity type, not the default/literal sense of the name,
  • If there is an ambiguity in the type/sense of a name, then the the default/literal sense of the name is chosen (following Markert and Nissim, 2002 ).
  • For more details, see the "Annotation Guidelines.pdf" distributed with the corpus.

    Source Data

    Data was collected using blogs and newspapers in Norwegian, as well as parliament speeches and governamental reports.

    Initial Data Collection and Normalization

    The texts in the Norwegian Dependency Treebank (NDT) are manually annotated with morphological features, syntactic functions and hierarchical structure. The formalism used for the syntactic annotation is dependency grammar.

    The treebanks consists of two parts, one part in Norwegian Bokmål ( nob ) and one part in Norwegian Nynorsk ( nno ). Both parts contain around 300.000 tokens, and are a mix of different non-fictional genres.

    See the NDT webpage for more details.

    Annotations

    The following types of entities are annotated:

    • Person ( PER ): Real or fictional characters and animals
    • Organization ( ORG ): Any collection of people, such as firms, institutions, organizations, music groups, sports teams, unions, political parties etc.
    • Location ( LOC ): Geographical places, buildings and facilities
    • Geo-political entity ( GPE ): Geographical regions defined by political and/or social groups. A GPE entity subsumes and does not distinguish between a nation, its region, its government, or its people
    • Product ( PROD ): Artificially produced entities are regarded products. This may include more abstract entities, such as speeches, radio shows, programming languages, contracts, laws and ideas.
    • Event ( EVT ): Festivals, cultural events, sports events, weather phenomena, wars, etc. Events are bounded in time and space.
    • Derived ( DRV ): Words (and phrases?) that are dervied from a name, but not a name in themselves. They typically contain a full name and are capitalized, but are not proper nouns. Examples (fictive) are "Brann-treneren" ("the Brann coach") or "Oslo-mannen" ("the man from Oslo").
    • Miscellaneous ( MISC ): Names that do not belong in the other categories. Examples are animals species and names of medical conditions. Entities that are manufactured or produced are of type Products, whereas thing naturally or spontaneously occurring are of type Miscellaneous.

    Furthermore, all GPE entities are additionally sub-categorized as being either ORG or LOC , with the two annotation levels separated by an underscore:

    • GPE_LOC : Geo-political entity, with a locative sense (e.g. "John lives in Spain ")
    • GPE_ORG : Geo-political entity, with an organisation sense (e.g. " Spain declined to meet with Belgium")

    The two special types GPE_LOC and GPE_ORG can easily be altered depending on the task, choosing either the more general GPE tag or the more specific LOC / ORG tags, conflating them with the other annotations of the same type. This means that the following sets of entity types can be derived:

    • 7 types, deleting _GPE : ORG , LOC , PER , PROD , EVT , DRV , MISC
    • 8 types, deleting LOC_ and ORG_ : ORG , LOC , GPE , PER , PROD , EVT , DRV , MISC
    • 9 types, keeping all types: ORG , LOC , GPE_LOC , GPE_ORG , PER , PROD , EVT , DRV , MISC

    The class distribution is as follows, broken down across the data splits of the UD version of NDT, and sorted by total counts (i.e. the number of examples, not tokens within the spans of the annotatons):

    Type Train Dev Test Total
    PER 4033 607 560 5200
    ORG 2828 400 283 3511
    GPE_LOC 2132 258 257 2647
    PROD 671 162 71 904
    LOC 613 109 103 825
    GPE_ORG 388 55 50 493
    DRV 519 77 48 644
    EVT 131 9 5 145
    MISC 8 0 0 0

    To access these reduced versions of the dataset, you can use the configs bokmaal-7 , nynorsk-7 , combined-7 for the NER tag set with 7 tags ( ORG , LOC , PER , PROD , EVT , DRV , MISC ), and bokmaal-8 , nynorsk-8 , combined-8 for the NER tag set with 8 tags ( LOC_ and ORG_ : ORG , LOC , GPE , PER , PROD , EVT , DRV , MISC ). By default, the full set (9 tags) will be used.

    Additional Information

    Dataset Curators

    NorNE was created as a collaboration between Schibsted Media Group , Språkbanken at the National Library of Norway and the Language Technology Group at the University of Oslo.

    NorNE was added to ? Datasets by the AI-Lab at the National Library of Norway.

    Licensing Information

    The NorNE corpus is published under the same license as the Norwegian Dependency Treebank

    Citation Information

    This dataset is described in the paper NorNE: Annotating Named Entities for Norwegian by Fredrik Jørgensen, Tobias Aasmoe, Anne-Stine Ruud Husevåg, Lilja Øvrelid, and Erik Velldal, accepted for LREC 2020 and available as pre-print here: https://arxiv.org/abs/1911.12146 .

    @inproceedings{johansen2019ner,
      title={NorNE: Annotating Named Entities for Norwegian},
      author={Fredrik Jørgensen, Tobias Aasmoe, Anne-Stine Ruud Husevåg,
              Lilja Øvrelid, and Erik Velldal},
      booktitle={LREC 2020},
      year={2020},
      url={https://arxiv.org/abs/1911.12146}
    }
    

    Contributions

    Thanks to @versae for adding this dataset.