数据集:
NbAiLab/norne
任务:
标记分类语言:
no计算机处理:
monolingual大小:
10K<n<100K语言创建人:
crowdsourced批注创建人:
expert-generated源数据集:
original预印本库:
arxiv:1911.12146许可:
otherNorNE is a manually annotated corpus of named entities which extends the annotation of the existing Norwegian Dependency Treebank. Comprising both of the official standards of written Norwegian (Bokmål and Nynorsk), the corpus contains around 600,000 tokens and annotates a rich set of entity types including persons,organizations, locations, geo-political entities, products, and events, in addition to a class corresponding to nominals derived from names.
NorNE ads named entity annotations on top of the Norwegian Dependency Treebank.
Both Norwegian Bokmål ( bokmaal ) and Nynorsk ( nynorsk ) are supported as different configs in this dataset. An extra config for the combined languages is also included ( combined ). See the Annotation section for details on accessing reduced tag sets for the NER feature.
Each entry contains text sentences, their language, identifiers, tokens, lemmas, and corresponding NER and POS tag lists.
An example of the train split of the bokmaal config.
{'idx': '000001', 'lang': 'bokmaal', 'lemmas': ['lam', 'og', 'piggvar', 'på', 'bryllupsmeny'], 'ner_tags': [0, 0, 0, 0, 0], 'pos_tags': [0, 9, 0, 5, 0], 'text': 'Lam og piggvar på bryllupsmenyen', 'tokens': ['Lam', 'og', 'piggvar', 'på', 'bryllupsmenyen']}
Each entry is annotated with the next fields:
An example DataFrame obtained from the dataset:
idx | lang | text | tokens | lemmas | ner_tags | pos_tags | |
---|---|---|---|---|---|---|---|
0 | 000001 | bokmaal | Lam og piggvar på bryllupsmenyen | [Lam, og, piggvar, på, bryllupsmenyen] | [lam, og, piggvar, på, bryllupsmeny] | [0, 0, 0, 0, 0] | [0, 9, 0, 5, 0] |
1 | 000002 | bokmaal | Kamskjell, piggvar og lammefilet sto på menyen... | [Kamskjell, ,, piggvar, og, lammefilet, sto, p... | [kamskjell, $,, piggvar, og, lammefilet, stå, ... | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] | [0, 1, 0, 9, 0, 15, 2, 0, 2, 8, 6, 0, 1] |
2 | 000003 | bokmaal | Og til dessert: Parfait à la Mette-Marit. | [Og, til, dessert, :, Parfait, à, la, Mette-Ma... | [og, til, dessert, $:, Parfait, à, la, Mette-M... | [0, 0, 0, 0, 7, 8, 8, 8, 0] | [9, 2, 0, 1, 10, 12, 12, 10, 1] |
There are three splits: train , validation and test .
Config | Split | Total |
---|---|---|
bokmaal | train | 15696 |
bokmaal | validation | 2410 |
bokmaal | test | 1939 |
nynorsk | train | 14174 |
nynorsk | validation | 1890 |
nynorsk | test | 1511 |
combined | test | 29870 |
combined | validation | 4300 |
combined | test | 3450 |
For more details, see the "Annotation Guidelines.pdf" distributed with the corpus.
Data was collected using blogs and newspapers in Norwegian, as well as parliament speeches and governamental reports.
Initial Data Collection and NormalizationThe texts in the Norwegian Dependency Treebank (NDT) are manually annotated with morphological features, syntactic functions and hierarchical structure. The formalism used for the syntactic annotation is dependency grammar.
The treebanks consists of two parts, one part in Norwegian Bokmål ( nob ) and one part in Norwegian Nynorsk ( nno ). Both parts contain around 300.000 tokens, and are a mix of different non-fictional genres.
See the NDT webpage for more details.
The following types of entities are annotated:
Furthermore, all GPE entities are additionally sub-categorized as being either ORG or LOC , with the two annotation levels separated by an underscore:
The two special types GPE_LOC and GPE_ORG can easily be altered depending on the task, choosing either the more general GPE tag or the more specific LOC / ORG tags, conflating them with the other annotations of the same type. This means that the following sets of entity types can be derived:
The class distribution is as follows, broken down across the data splits of the UD version of NDT, and sorted by total counts (i.e. the number of examples, not tokens within the spans of the annotatons):
Type | Train | Dev | Test | Total |
---|---|---|---|---|
PER | 4033 | 607 | 560 | 5200 |
ORG | 2828 | 400 | 283 | 3511 |
GPE_LOC | 2132 | 258 | 257 | 2647 |
PROD | 671 | 162 | 71 | 904 |
LOC | 613 | 109 | 103 | 825 |
GPE_ORG | 388 | 55 | 50 | 493 |
DRV | 519 | 77 | 48 | 644 |
EVT | 131 | 9 | 5 | 145 |
MISC | 8 | 0 | 0 | 0 |
To access these reduce versions of the dataset, you can use the configs bokmaal-7 , nynorsk-7 , combined-7 for the NER tag set with 7 tags ( ORG , LOC , PER , PROD , EVT , DRV , MISC ), and bokmaal-8 , nynorsk-8 , combined-8 for the NER tag set with 8 tags ( LOC_ and ORG_ : ORG , LOC , GPE , PER , PROD , EVT , DRV , MISC ). By default, the full set (9 tags) will be used.
NorNE was created as a collaboration between Schibsted Media Group , Språkbanken at the National Library of Norway and the Language Technology Group at the University of Oslo.
NorNE was added to Huggingface Datasets by the AI-Lab at the National Library of Norway.
The NorNE corpus is published under the same license as the Norwegian Dependency Treebank
This dataset is described in the paper NorNE: Annotating Named Entities for Norwegian by Fredrik Jørgensen, Tobias Aasmoe, Anne-Stine Ruud Husevåg, Lilja Øvrelid, and Erik Velldal, accepted for LREC 2020 and available as pre-print here: https://arxiv.org/abs/1911.12146 .