数据集:

NbAiLab/norne

任务:

标记分类

子任务:

named-entity-recognition part-of-speech

语言:

计算机处理:

monolingual

大小:

10K<n<100K

语言创建人:

crowdsourced

批注创建人:

expert-generated

源数据集:

original

预印本库:

arxiv:1911.12146

其他:

structure-prediction

许可:

other

数据集介绍文件清单

中文

Dataset Card for NorNE: Norwegian Named Entities

Dataset Summary

NorNE is a manually annotated corpus of named entities which extends the annotation of the existing Norwegian Dependency Treebank. Comprising both of the official standards of written Norwegian (Bokmål and Nynorsk), the corpus contains around 600,000 tokens and annotates a rich set of entity types including persons,organizations, locations, geo-political entities, products, and events, in addition to a class corresponding to nominals derived from names.

Supported Tasks and Leaderboards

NorNE ads named entity annotations on top of the Norwegian Dependency Treebank.

Languages

Both Norwegian Bokmål ( bokmaal ) and Nynorsk ( nynorsk ) are supported as different configs in this dataset. An extra config for the combined languages is also included ( combined ). See the Annotation section for details on accessing reduced tag sets for the NER feature.

Dataset Structure

Each entry contains text sentences, their language, identifiers, tokens, lemmas, and corresponding NER and POS tag lists.

Data Instances

An example of the train split of the bokmaal config.

{'idx': '000001',
 'lang': 'bokmaal',
 'lemmas': ['lam', 'og', 'piggvar', 'på', 'bryllupsmeny'],
 'ner_tags': [0, 0, 0, 0, 0],
 'pos_tags': [0, 9, 0, 5, 0],
 'text': 'Lam og piggvar på bryllupsmenyen',
 'tokens': ['Lam', 'og', 'piggvar', 'på', 'bryllupsmenyen']}

Data Fields

Each entry is annotated with the next fields:

idx ( int ), text (sentence) identifier from the NorNE dataset
lang ( str ), language variety, either bokmaal , nynorsk or combined
text ( str ), plain text
tokens ( List[str] ), list of tokens extracted from text
lemmas ( List[str] ), list of lemmas extracted from tokens
ner_tags ( List[int] ), list of numeric NER tags for each token in tokens
pos_tags ( List[int] ), list of numeric PoS tags for each token in tokens

An example DataFrame obtained from the dataset:

idx	lang	text	tokens	lemmas	ner_tags	pos_tags
0	000001	bokmaal	Lam og piggvar på bryllupsmenyen	[Lam, og, piggvar, på, bryllupsmenyen]	[lam, og, piggvar, på, bryllupsmeny]	[0, 0, 0, 0, 0]	[0, 9, 0, 5, 0]
1	000002	bokmaal	Kamskjell, piggvar og lammefilet sto på menyen...	[Kamskjell, ,, piggvar, og, lammefilet, sto, p...	[kamskjell, $,, piggvar, og, lammefilet, stå, ...	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]	[0, 1, 0, 9, 0, 15, 2, 0, 2, 8, 6, 0, 1]
2	000003	bokmaal	Og til dessert: Parfait à la Mette-Marit.	[Og, til, dessert, :, Parfait, à, la, Mette-Ma...	[og, til, dessert, $:, Parfait, à, la, Mette-M...	[0, 0, 0, 0, 7, 8, 8, 8, 0]	[9, 2, 0, 1, 10, 12, 12, 10, 1]

Data Splits

There are three splits: train , validation and test .

Config	Split	Total
bokmaal	train	15696
bokmaal	validation	2410
bokmaal	test	1939
nynorsk	train	14174
nynorsk	validation	1890
nynorsk	test	1511
combined	test	29870
combined	validation	4300
combined	test	3450

Dataset Creation

Curation Rationale

A name in this context is close to Saul Kripke's definition of a name , in that a name has a unique reference and its meaning is constant (there are exceptions in the annotations, e.g. "Regjeringen" (en. "Government")).

It is the usage of a name that determines the entity type, not the default/literal sense of the name,

If there is an ambiguity in the type/sense of a name, then the the default/literal sense of the name is chosen (following Markert and Nissim, 2002 ).

For more details, see the "Annotation Guidelines.pdf" distributed with the corpus.

Source Data

Data was collected using blogs and newspapers in Norwegian, as well as parliament speeches and governamental reports.

Initial Data Collection and Normalization

The texts in the Norwegian Dependency Treebank (NDT) are manually annotated with morphological features, syntactic functions and hierarchical structure. The formalism used for the syntactic annotation is dependency grammar.

The treebanks consists of two parts, one part in Norwegian Bokmål ( nob ) and one part in Norwegian Nynorsk ( nno ). Both parts contain around 300.000 tokens, and are a mix of different non-fictional genres.

See the NDT webpage for more details.

Annotations

The following types of entities are annotated:

Person ( PER ): Real or fictional characters and animals
Organization ( ORG ): Any collection of people, such as firms, institutions, organizations, music groups, sports teams, unions, political parties etc.
Location ( LOC ): Geographical places, buildings and facilities
Geo-political entity ( GPE ): Geographical regions defined by political and/or social groups. A GPE entity subsumes and does not distinguish between a nation, its region, its government, or its people
Product ( PROD ): Artificially produced entities are regarded products. This may include more abstract entities, such as speeches, radio shows, programming languages, contracts, laws and ideas.
Event ( EVT ): Festivals, cultural events, sports events, weather phenomena, wars, etc. Events are bounded in time and space.
Derived ( DRV ): Words (and phrases?) that are dervied from a name, but not a name in themselves. They typically contain a full name and are capitalized, but are not proper nouns. Examples (fictive) are "Brann-treneren" ("the Brann coach") or "Oslo-mannen" ("the man from Oslo").
Miscellaneous ( MISC ): Names that do not belong in the other categories. Examples are animals species and names of medical conditions. Entities that are manufactured or produced are of type Products, whereas thing naturally or spontaneously occurring are of type Miscellaneous.

Furthermore, all GPE entities are additionally sub-categorized as being either ORG or LOC , with the two annotation levels separated by an underscore:

GPE_LOC : Geo-political entity, with a locative sense (e.g. "John lives in Spain ")
GPE_ORG : Geo-political entity, with an organisation sense (e.g. " Spain declined to meet with Belgium")

The two special types GPE_LOC and GPE_ORG can easily be altered depending on the task, choosing either the more general GPE tag or the more specific LOC / ORG tags, conflating them with the other annotations of the same type. This means that the following sets of entity types can be derived:

7 types, deleting _GPE : ORG , LOC , PER , PROD , EVT , DRV , MISC
8 types, deleting LOC_ and ORG_ : ORG , LOC , GPE , PER , PROD , EVT , DRV , MISC
9 types, keeping all types: ORG , LOC , GPE_LOC , GPE_ORG , PER , PROD , EVT , DRV , MISC

The class distribution is as follows, broken down across the data splits of the UD version of NDT, and sorted by total counts (i.e. the number of examples, not tokens within the spans of the annotatons):

Type	Train	Dev	Test	Total
PER	4033	607	560	5200
ORG	2828	400	283	3511
GPE_LOC	2132	258	257	2647
PROD	671	162	71	904
LOC	613	109	103	825
GPE_ORG	388	55	50	493
DRV	519	77	48	644
EVT	131	9	5	145
MISC	8	0	0	0

To access these reduce versions of the dataset, you can use the configs bokmaal-7 , nynorsk-7 , combined-7 for the NER tag set with 7 tags ( ORG , LOC , PER , PROD , EVT , DRV , MISC ), and bokmaal-8 , nynorsk-8 , combined-8 for the NER tag set with 8 tags ( LOC_ and ORG_ : ORG , LOC , GPE , PER , PROD , EVT , DRV , MISC ). By default, the full set (9 tags) will be used.

Additional Information

Dataset Curators

NorNE was created as a collaboration between Schibsted Media Group , Språkbanken at the National Library of Norway and the Language Technology Group at the University of Oslo.

NorNE was added to Huggingface Datasets by the AI-Lab at the National Library of Norway.

Licensing Information

The NorNE corpus is published under the same license as the Norwegian Dependency Treebank

Citation Information

This dataset is described in the paper NorNE: Annotating Named Entities for Norwegian by Fredrik Jørgensen, Tobias Aasmoe, Anne-Stine Ruud Husevåg, Lilja Øvrelid, and Erik Velldal, accepted for LREC 2020 and available as pre-print here: https://arxiv.org/abs/1911.12146 .

作者:

NbAiLab

数据集大小:

76.26 KB