chcaa/DANSK | ATYUN.COM 官网-人工智能教程资讯全方位服务平台

数据集:

chcaa/DANSK

语言:

数据集介绍文件清单

中文

Dataset Summary

DANSK: Danish Annotations for NLP Specific TasKs is a dataset consisting of texts from multiple domains, sampled from the Danish GigaWord Corpus (DAGW). The dataset was created to fill in the gap of Danish NLP datasets from different domains, that are required for training models that generalize across domains. The Named-Entity annotations are moreover fine-grained and have a similar form to that of OntoNotes v5, which significantly broadens the use cases of the dataset. The domains include Web, News, Wiki & Books, Legal, Dannet, Conversation and Social Media. For a more in-depth understanding of the domains, please refer to DAGW .

The distribution of texts and Named Entities within each domain can be seen in the table below:

Update log

2023-05-26: Added individual annotations for each annotator to allow for analysis of inter-annotator agreement

Supported Tasks

The DANSK dataset currently only supports Named-Entity Recognition, but additional version releases will contain data for more tasks.

Languages

All texts in the dataset are in Danish. Slang from various platforms or dialects may appear, consistent with the domains from which the texts originally have been sampled - e.g. Social Media.

Dataset Structure

Data Instances

The JSON-formatted data is in the form seen below:

{
    "text": "Aborrer over 2 kg er en uhyre sj\u00e6lden fangst.",
    "ents": [{"start": 13, "end": 17, "label": "QUANTITY"}],
    "sents": [{"start": 0, "end": 45}],
    "tokens": [
        {"id": 0, "start": 0, "end": 7},
        {"id": 1, "start": 8, "end": 12},
        {"id": 2, "start": 13, "end": 14},
        {"id": 3, "start": 15, "end": 17},
        {"id": 4, "start": 18, "end": 20},
        {"id": 5, "start": 21, "end": 23},
        {"id": 6, "start": 24, "end": 29},
        {"id": 7, "start": 30, "end": 37},
        {"id": 8, "start": 38, "end": 44},
        {"id": 9, "start": 44, "end": 45},
    ],
    "spans": {"incorrect_spans": []},
    "dagw_source": "wiki",
    "dagw_domain": "Wiki & Books",
    "dagw_source_full": "Wikipedia",
}

Data Fields

text : The text
ents : The annotated entities
sents : The sentences of the text
dagw_source : Shorthand name of the source from which the text has been sampled in the Danish Gigaword Corpus
dagw_source_full : Full name of the source from which the text has been sampled in the Danish Gigaword Corpus
dagw_domain : Name of the domain to which the source adheres to

Data Splits

The data was randomly split up into three distinct partitions; train, dev, as well as a test partition. The splits come from the same pool, and there are thus no underlying differences between the sets. To see the distribution of named entities, and domains of the different partitions, please refer to the paper, or read the superficial statistics provided in the Dataset composition section of this markdown

Descriptive Statistics

Dataset Composition

Named entity annotation composition across partitions can be seen in the table below:

Full	Train	Validation	Test
Texts	15062	12062 (80%)	1500 (10%)	1500 (10%)
Named entities	14462	11638 (80.47%)	1327 (9.18%)	1497 (10.25%)
CARDINAL	2069	1702 (82.26%)	168 (8.12%)	226 (10.92%)
DATE	1756	1411 (80.35%)	182 (10.36%)	163 (9.28%)
EVENT	211	175 (82.94%)	19 (9.00%)	17 (8.06%)
FACILITY	246	200 (81.30%)	25 (10.16%)	21 (8.54%)
GPE	1604	1276 (79.55%)	135 (8.42%)	193 (12.03%)
LANGUAGE	126	53 (42.06%)	17 (13.49%)	56 (44.44%)
LAW	183	148 (80.87%)	17 (9.29%)	18 (9.84%)
LOCATION	424	351 (82.78%)	46 (10.85%)	27 (6.37%)
MONEY	714	566 (79.27%)	72 (10.08%)	76 (10.64%)
NORP	495	405 (81.82%)	41 (8.28%)	49 (9.90%)
ORDINAL	127	105 (82.68%)	11 (8.66%)	11 (8.66%)
ORGANIZATION	2507	1960 (78.18%)	249 (9.93%)	298 (11.87%)
PERCENT	148	123 (83.11%)	13 (8.78%)	12 (8.11%)
PERSON	2133	1767 (82.84%)	191 (8.95%)	175 (8.20%)
PRODUCT	763	634 (83.09%)	57 (7.47%)	72 (9.44%)
QUANTITY	292	242 (82.88%)	28 (9.59%)	22 (7.53%)
TIME	218	185 (84.86%)	18 (8.26%)	15 (6.88%)
WORK OF ART	419	335 (79.95%)	38 (9.07%)	46 (10.98%)

Domain distribution

Domain and source distribution across partitions can be seen in the table below:

Domain	Source	Full	Train	Dev	Test
Conversation	Europa Parlamentet	206	173	17	16
Conversation	Folketinget	23	21	1	1
Conversation	NAAT	554	431	50	73
Conversation	OpenSubtitles	377	300	39	38
Conversation	Spontaneous speech	489	395	54	40
Dannet	Dannet	25	18	4	3
Legal	Retsinformation.dk	965	747	105	113
Legal	Skat.dk	471	364	53	54
Legal	Retspraktis	727	579	76	72
News	DanAvis	283	236	20	27
News	TV2R	138	110	16	12
Social Media	hestenettet.dk	554	439	51	64
Web	Common Crawl	8270	6661	826	783
Wiki & Books	adl	640	517	57	66
Wiki & Books	Wikipedia	279	208	30	41
Wiki & Books	WikiBooks	335	265	36	34
Wiki & Books	WikiSource	455	371	43	41

Entity Distribution across

Domain and named entity distributions for the training set can be seen below:

All domains combined	Conversation	Dannet	Legal	News	Social Media	Web	Wiki and Books
DOCS	12062	1320	18	1690	346	439	6661	1361
ENTS	11638	1060	15	1292	419	270	7502	883
CARDINAL	1702	346	6	95	35	17	1144	59
DATE	1411	113	5	257	40	29	831	126
EVENT	175	43	0	1	9	3	106	8
FACILITY	200	2	0	4	18	3	159	10
GPE	1276	130	2	60	68	31	846	128
LANGUAGE	53	3	0	0	0	0	34	16
LAW	148	10	0	100	1	0	22	13
LOCATION	351	18	0	1	7	7	288	29
MONEY	566	1	0	62	13	18	472	0
NORP	405	70	0	61	22	1	188	42
ORDINAL	105	11	0	17	9	2	43	22
ORGANIZATION	1960	87	0	400	61	39	1303	58
PERCENT	123	5	0	10	11	0	91	4
PERSON	1767	189	2	194	101	69	970	121
PRODUCT	634	3	0	10	2	33	581	3
QUANTITY	242	1	0	9	6	17	188	20
TIME	185	16	0	5	13	1	144	6
WORK OF ART	335	12	0	6	3	0	92	218

Domain and named entity distributions for the validation set can be seen below:

Sum	Conversation	Dannet	Legal	News	Social Media	Web	Wiki
DOCS	1500	161	4	234	36	51	826	166
ENTS	1497	110	4	171	43	30	983	143
CARDINAL	226	41	2	19	7	5	139	13
DATE	163	11	0	27	6	4	89	26
EVENT	17	2	0	0	1	0	13	1
FACILITY	21	1	0	0	0	0	16	4
GPE	193	17	1	8	7	2	131	25
LANGUAGE	56	0	0	0	0	0	50	6
LAW	18	2	0	8	0	0	8	0
LOCATION	27	2	0	1	0	0	21	3
MONEY	76	2	0	9	1	6	58	0
NORP	49	8	0	8	1	2	21	9
ORDINAL	11	2	0	2	0	1	3	3
ORGANIZATION	298	6	0	68	5	3	212	4
PERCENT	12	0	0	2	0	0	10	0
PERSON	175	16	1	16	11	4	96	20
PRODUCT	72	0	0	0	0	2	69	1
QUANTITY	22	0	0	1	2	1	17	1
TIME	15	0	0	0	2	0	13	0
WORK OF ART	46	0	0	2	0	0	17	27

Domain and named entity distributions for the testing set can be seen below:

Sum	Conversation	Dannet	Legal	News	Social Media	Web	Wiki
DOCS	1500	161	4	234	36	51	826	166
ENTS	1497	110	4	171	43	30	983	143
CARDINAL	226	41	2	19	7	5	139	13
DATE	163	11	0	27	6	4	89	26
EVENT	17	2	0	0	1	0	13	1
FACILITY	21	1	0	0	0	0	16	4
GPE	193	17	1	8	7	2	131	25
LANGUAGE	56	0	0	0	0	0	50	6
LAW	18	2	0	8	0	0	8	0
LOCATION	27	2	0	1	0	0	21	3
MONEY	76	2	0	9	1	6	58	0
NORP	49	8	0	8	1	2	21	9
ORDINAL	11	2	0	2	0	1	3	3
ORGANIZATION	298	6	0	68	5	3	212	4
PERCENT	12	0	0	2	0	0	10	0
PERSON	175	16	1	16	11	4	96	20
PRODUCT	72	0	0	0	0	2	69	1
QUANTITY	22	0	0	1	2	1	17	1
TIME	15	0	0	0	2	0	13	0
WORK OF ART	46	0	0	2	0	0	17	27

Dataset Creation

Curation Rationale

The dataset is meant to fill in the gap of Danish NLP that up until now has been missing a dataset with 1) fine-grained named entity recognition labels, and 2) high variance in domain origin of texts. As such, it is the intention that DANSK should be employed in training by anyone who wishes to create models for NER that are both generalizable across domains and fine-grained in their predictions. It may also be utilized to assess across-domain evaluations, in order to unfold any potential domain biases. While the dataset currently only entails annotations for named entities, it is the intention that future versions of the dataset will feature dependency Parsing, pos tagging, and possibly revised NER annotations.

Source Data

The data collection, annotation, and normalization steps of the data were extensive. As the description is too long for this readme, please refer to the associated paper upon its publication for a full description.

Initial Data Collection and Normalization

Annotations

Annotation process

To afford high granularity, the DANSK dataset utilized the annotation standard of OntoNotes 5.0. The standard features 18 different named entity types. The full description can be seen in the associated paper.

Who are the annotators?

Annotator Compensation

10 English Linguistics Master’s program students from Aarhus University were employed. They worked 10 hours/week for six weeks from October 11, 2021, to November 22, 2021. Their annotation tasks included part-of-speech tagging, dependency parsing, and NER annotation. Annotators were compensated at the standard rate for students, as determined by the collective agreement of the Danish Ministry of Finance and the Central Organization of Teachers and the CO10 Central Organization of 2010 (the CO10 joint agreement), which is 140DKK/hour. Named entity annotations and dependency parsing was done from scratch, while the POS tagging consisted of corrections of predictions by an NLP model.

Automatic correction

During the manual correction of the annotation a series of consistent errors were found. These were corrected using the following Regex patterns (see also the Danish Addendum to the Ontonotes annotation guidelines):

Regex Patterns

For matching with TIME spans, e.g. [16:30 - 17:30] (TIME):

\d{1,2}:\d\d ?[-|\||\/] ?\d
dag: \d{1,2}

For matching with DATE spans, e.g. [1938 - 1992] (DATE):

\d{2,4} ?[-|–] ?\d{2,4}

For matching companies with A/S og ApS,

e.g. [Hansens Skomager A/S] (ORGANIZATION):
ApS
A\/S

For matching written numerals, e.g. "en":

to | to$|^to| To | To$|^To| TO | TO$|^TO|
tre | tre$|^tre| Tre | Tre$|^Tre| TRE | TRE$|^TRE|
fire | fire$|^fire| Fire | Fire$|^Fire| FIRE | FIRE$|^FIRE|
fem | fem$|^fem| Fem | Fem$|^Fem| FEM | FEM$|^FEM|
seks | seks$|^seks| Seks | Seks$|^Seks| SEKS | SEKS$|
^SYV|
otte | otte$|^otte| Otte | Otte$|^Otte| OTTE | OTTE$|^OTTE|
ni | ni$|^ni| Ni | Ni$|^Ni| NI | NI$|^NI|
ti | ti$|^ti| Ti | Ti$|^Ti| TI | TI$|^TI

For matching "Himlen" or "Himmelen" already annotated as LOCATION, e.g. "HIMLEN":

[Hh][iI][mM][lL][Ee][Nn]|[Hh][iI][mM][mM][Ee][lL][Ee][Nn]

For matching "Gud" already tagged as PERSON, e.g. "GUD":

[Gg][Uu][Dd]

For matching telephone numbers wrongly already tagged as CARDINAL, e.g. "20 40 44 30":

\d{2} \d{2} \d{2} \d{2}
\+\d{2} \d{2} ?\d{2} ?\d{2} ?\d{2}$
\+\d{2} \d{2} ?\d{2} ?\d{2} ?\d{2}$
 \d{4} ?\d{4}$
^\d{4} ?\d{4}$

For matching websites already wrongly tagged as ORGANIZATION:

.dk$|.com$

For matching Hotels and Resorts already wrongly tagged as ORGANIZATION:

.*[h|H]otel.*|.*[R|r]esort.*

For matching numbers including / or :, already wrongly tagged as CARDINAL:

\/
\/
 
-

For matching rights already wrongly tagged as LAW:

[C|c]opyright
[®|©]
[f|F]ortrydelsesret
[o|O]phavsret$
enneskeret

Licensing Information

Creative Commons Attribution-Share Alike 4.0 International license

Citation Information

The paper is in progress.

作者:

chcaa

数据集大小:

33 MB