数据集:
chcaa/DANSK
语言:
daDANSK: Danish Annotations for NLP Specific TasKs is a dataset consisting of texts from multiple domains, sampled from the Danish GigaWord Corpus (DAGW). The dataset was created to fill in the gap of Danish NLP datasets from different domains, that are required for training models that generalize across domains. The Named-Entity annotations are moreover fine-grained and have a similar form to that of OntoNotes v5, which significantly broadens the use cases of the dataset. The domains include Web, News, Wiki & Books, Legal, Dannet, Conversation and Social Media. For a more in-depth understanding of the domains, please refer to DAGW .
The distribution of texts and Named Entities within each domain can be seen in the table below:
The DANSK dataset currently only supports Named-Entity Recognition, but additional version releases will contain data for more tasks.
All texts in the dataset are in Danish. Slang from various platforms or dialects may appear, consistent with the domains from which the texts originally have been sampled - e.g. Social Media.
The JSON-formatted data is in the form seen below:
{ "text": "Aborrer over 2 kg er en uhyre sj\u00e6lden fangst.", "ents": [{"start": 13, "end": 17, "label": "QUANTITY"}], "sents": [{"start": 0, "end": 45}], "tokens": [ {"id": 0, "start": 0, "end": 7}, {"id": 1, "start": 8, "end": 12}, {"id": 2, "start": 13, "end": 14}, {"id": 3, "start": 15, "end": 17}, {"id": 4, "start": 18, "end": 20}, {"id": 5, "start": 21, "end": 23}, {"id": 6, "start": 24, "end": 29}, {"id": 7, "start": 30, "end": 37}, {"id": 8, "start": 38, "end": 44}, {"id": 9, "start": 44, "end": 45}, ], "spans": {"incorrect_spans": []}, "dagw_source": "wiki", "dagw_domain": "Wiki & Books", "dagw_source_full": "Wikipedia", }
The data was randomly split up into three distinct partitions; train, dev, as well as a test partition. The splits come from the same pool, and there are thus no underlying differences between the sets. To see the distribution of named entities, and domains of the different partitions, please refer to the paper, or read the superficial statistics provided in the Dataset composition section of this markdown
Named entity annotation composition across partitions can be seen in the table below:
Full | Train | Validation | Test | |
---|---|---|---|---|
Texts | 15062 | 12062 (80%) | 1500 (10%) | 1500 (10%) |
Named entities | 14462 | 11638 (80.47%) | 1327 (9.18%) | 1497 (10.25%) |
CARDINAL | 2069 | 1702 (82.26%) | 168 (8.12%) | 226 (10.92%) |
DATE | 1756 | 1411 (80.35%) | 182 (10.36%) | 163 (9.28%) |
EVENT | 211 | 175 (82.94%) | 19 (9.00%) | 17 (8.06%) |
FACILITY | 246 | 200 (81.30%) | 25 (10.16%) | 21 (8.54%) |
GPE | 1604 | 1276 (79.55%) | 135 (8.42%) | 193 (12.03%) |
LANGUAGE | 126 | 53 (42.06%) | 17 (13.49%) | 56 (44.44%) |
LAW | 183 | 148 (80.87%) | 17 (9.29%) | 18 (9.84%) |
LOCATION | 424 | 351 (82.78%) | 46 (10.85%) | 27 (6.37%) |
MONEY | 714 | 566 (79.27%) | 72 (10.08%) | 76 (10.64%) |
NORP | 495 | 405 (81.82%) | 41 (8.28%) | 49 (9.90%) |
ORDINAL | 127 | 105 (82.68%) | 11 (8.66%) | 11 (8.66%) |
ORGANIZATION | 2507 | 1960 (78.18%) | 249 (9.93%) | 298 (11.87%) |
PERCENT | 148 | 123 (83.11%) | 13 (8.78%) | 12 (8.11%) |
PERSON | 2133 | 1767 (82.84%) | 191 (8.95%) | 175 (8.20%) |
PRODUCT | 763 | 634 (83.09%) | 57 (7.47%) | 72 (9.44%) |
QUANTITY | 292 | 242 (82.88%) | 28 (9.59%) | 22 (7.53%) |
TIME | 218 | 185 (84.86%) | 18 (8.26%) | 15 (6.88%) |
WORK OF ART | 419 | 335 (79.95%) | 38 (9.07%) | 46 (10.98%) |
Domain and source distribution across partitions can be seen in the table below:
Domain | Source | Full | Train | Dev | Test |
---|---|---|---|---|---|
Conversation | Europa Parlamentet | 206 | 173 | 17 | 16 |
Conversation | Folketinget | 23 | 21 | 1 | 1 |
Conversation | NAAT | 554 | 431 | 50 | 73 |
Conversation | OpenSubtitles | 377 | 300 | 39 | 38 |
Conversation | Spontaneous speech | 489 | 395 | 54 | 40 |
Dannet | Dannet | 25 | 18 | 4 | 3 |
Legal | Retsinformation.dk | 965 | 747 | 105 | 113 |
Legal | Skat.dk | 471 | 364 | 53 | 54 |
Legal | Retspraktis | 727 | 579 | 76 | 72 |
News | DanAvis | 283 | 236 | 20 | 27 |
News | TV2R | 138 | 110 | 16 | 12 |
Social Media | hestenettet.dk | 554 | 439 | 51 | 64 |
Web | Common Crawl | 8270 | 6661 | 826 | 783 |
Wiki & Books | adl | 640 | 517 | 57 | 66 |
Wiki & Books | Wikipedia | 279 | 208 | 30 | 41 |
Wiki & Books | WikiBooks | 335 | 265 | 36 | 34 |
Wiki & Books | WikiSource | 455 | 371 | 43 | 41 |
Domain and named entity distributions for the training set can be seen below:
All domains combined | Conversation | Dannet | Legal | News | Social Media | Web | Wiki and Books | |
---|---|---|---|---|---|---|---|---|
DOCS | 12062 | 1320 | 18 | 1690 | 346 | 439 | 6661 | 1361 |
ENTS | 11638 | 1060 | 15 | 1292 | 419 | 270 | 7502 | 883 |
CARDINAL | 1702 | 346 | 6 | 95 | 35 | 17 | 1144 | 59 |
DATE | 1411 | 113 | 5 | 257 | 40 | 29 | 831 | 126 |
EVENT | 175 | 43 | 0 | 1 | 9 | 3 | 106 | 8 |
FACILITY | 200 | 2 | 0 | 4 | 18 | 3 | 159 | 10 |
GPE | 1276 | 130 | 2 | 60 | 68 | 31 | 846 | 128 |
LANGUAGE | 53 | 3 | 0 | 0 | 0 | 0 | 34 | 16 |
LAW | 148 | 10 | 0 | 100 | 1 | 0 | 22 | 13 |
LOCATION | 351 | 18 | 0 | 1 | 7 | 7 | 288 | 29 |
MONEY | 566 | 1 | 0 | 62 | 13 | 18 | 472 | 0 |
NORP | 405 | 70 | 0 | 61 | 22 | 1 | 188 | 42 |
ORDINAL | 105 | 11 | 0 | 17 | 9 | 2 | 43 | 22 |
ORGANIZATION | 1960 | 87 | 0 | 400 | 61 | 39 | 1303 | 58 |
PERCENT | 123 | 5 | 0 | 10 | 11 | 0 | 91 | 4 |
PERSON | 1767 | 189 | 2 | 194 | 101 | 69 | 970 | 121 |
PRODUCT | 634 | 3 | 0 | 10 | 2 | 33 | 581 | 3 |
QUANTITY | 242 | 1 | 0 | 9 | 6 | 17 | 188 | 20 |
TIME | 185 | 16 | 0 | 5 | 13 | 1 | 144 | 6 |
WORK OF ART | 335 | 12 | 0 | 6 | 3 | 0 | 92 | 218 |
Domain and named entity distributions for the validation set can be seen below:
Sum | Conversation | Dannet | Legal | News | Social Media | Web | Wiki | |
---|---|---|---|---|---|---|---|---|
DOCS | 1500 | 161 | 4 | 234 | 36 | 51 | 826 | 166 |
ENTS | 1497 | 110 | 4 | 171 | 43 | 30 | 983 | 143 |
CARDINAL | 226 | 41 | 2 | 19 | 7 | 5 | 139 | 13 |
DATE | 163 | 11 | 0 | 27 | 6 | 4 | 89 | 26 |
EVENT | 17 | 2 | 0 | 0 | 1 | 0 | 13 | 1 |
FACILITY | 21 | 1 | 0 | 0 | 0 | 0 | 16 | 4 |
GPE | 193 | 17 | 1 | 8 | 7 | 2 | 131 | 25 |
LANGUAGE | 56 | 0 | 0 | 0 | 0 | 0 | 50 | 6 |
LAW | 18 | 2 | 0 | 8 | 0 | 0 | 8 | 0 |
LOCATION | 27 | 2 | 0 | 1 | 0 | 0 | 21 | 3 |
MONEY | 76 | 2 | 0 | 9 | 1 | 6 | 58 | 0 |
NORP | 49 | 8 | 0 | 8 | 1 | 2 | 21 | 9 |
ORDINAL | 11 | 2 | 0 | 2 | 0 | 1 | 3 | 3 |
ORGANIZATION | 298 | 6 | 0 | 68 | 5 | 3 | 212 | 4 |
PERCENT | 12 | 0 | 0 | 2 | 0 | 0 | 10 | 0 |
PERSON | 175 | 16 | 1 | 16 | 11 | 4 | 96 | 20 |
PRODUCT | 72 | 0 | 0 | 0 | 0 | 2 | 69 | 1 |
QUANTITY | 22 | 0 | 0 | 1 | 2 | 1 | 17 | 1 |
TIME | 15 | 0 | 0 | 0 | 2 | 0 | 13 | 0 |
WORK OF ART | 46 | 0 | 0 | 2 | 0 | 0 | 17 | 27 |
Domain and named entity distributions for the testing set can be seen below:
Sum | Conversation | Dannet | Legal | News | Social Media | Web | Wiki | |
---|---|---|---|---|---|---|---|---|
DOCS | 1500 | 161 | 4 | 234 | 36 | 51 | 826 | 166 |
ENTS | 1497 | 110 | 4 | 171 | 43 | 30 | 983 | 143 |
CARDINAL | 226 | 41 | 2 | 19 | 7 | 5 | 139 | 13 |
DATE | 163 | 11 | 0 | 27 | 6 | 4 | 89 | 26 |
EVENT | 17 | 2 | 0 | 0 | 1 | 0 | 13 | 1 |
FACILITY | 21 | 1 | 0 | 0 | 0 | 0 | 16 | 4 |
GPE | 193 | 17 | 1 | 8 | 7 | 2 | 131 | 25 |
LANGUAGE | 56 | 0 | 0 | 0 | 0 | 0 | 50 | 6 |
LAW | 18 | 2 | 0 | 8 | 0 | 0 | 8 | 0 |
LOCATION | 27 | 2 | 0 | 1 | 0 | 0 | 21 | 3 |
MONEY | 76 | 2 | 0 | 9 | 1 | 6 | 58 | 0 |
NORP | 49 | 8 | 0 | 8 | 1 | 2 | 21 | 9 |
ORDINAL | 11 | 2 | 0 | 2 | 0 | 1 | 3 | 3 |
ORGANIZATION | 298 | 6 | 0 | 68 | 5 | 3 | 212 | 4 |
PERCENT | 12 | 0 | 0 | 2 | 0 | 0 | 10 | 0 |
PERSON | 175 | 16 | 1 | 16 | 11 | 4 | 96 | 20 |
PRODUCT | 72 | 0 | 0 | 0 | 0 | 2 | 69 | 1 |
QUANTITY | 22 | 0 | 0 | 1 | 2 | 1 | 17 | 1 |
TIME | 15 | 0 | 0 | 0 | 2 | 0 | 13 | 0 |
WORK OF ART | 46 | 0 | 0 | 2 | 0 | 0 | 17 | 27 |
The dataset is meant to fill in the gap of Danish NLP that up until now has been missing a dataset with 1) fine-grained named entity recognition labels, and 2) high variance in domain origin of texts. As such, it is the intention that DANSK should be employed in training by anyone who wishes to create models for NER that are both generalizable across domains and fine-grained in their predictions. It may also be utilized to assess across-domain evaluations, in order to unfold any potential domain biases. While the dataset currently only entails annotations for named entities, it is the intention that future versions of the dataset will feature dependency Parsing, pos tagging, and possibly revised NER annotations.
The data collection, annotation, and normalization steps of the data were extensive. As the description is too long for this readme, please refer to the associated paper upon its publication for a full description.
Initial Data Collection and NormalizationTo afford high granularity, the DANSK dataset utilized the annotation standard of OntoNotes 5.0. The standard features 18 different named entity types. The full description can be seen in the associated paper.
Who are the annotators?10 English Linguistics Master’s program students from Aarhus University were employed. They worked 10 hours/week for six weeks from October 11, 2021, to November 22, 2021. Their annotation tasks included part-of-speech tagging, dependency parsing, and NER annotation. Named entity annotations and dependency parsing was done from scratch, while the POS tagging consisted of corrections of silver-standard predictions by an NLP model.
10 English Linguistics Master’s program students from Aarhus University were employed. They worked 10 hours/week for six weeks from October 11, 2021, to November 22, 2021. Their annotation tasks included part-of-speech tagging, dependency parsing, and NER annotation. Annotators were compensated at the standard rate for students, as determined by the collective agreement of the Danish Ministry of Finance and the Central Organization of Teachers and the CO10 Central Organization of 2010 (the CO10 joint agreement), which is 140DKK/hour. Named entity annotations and dependency parsing was done from scratch, while the POS tagging consisted of corrections of predictions by an NLP model.
During the manual correction of the annotation a series of consistent errors were found. These were corrected using the following Regex patterns (see also the Danish Addendum to the Ontonotes annotation guidelines):
Regex PatternsFor matching with TIME spans, e.g. [16:30 - 17:30] (TIME):
\d{1,2}:\d\d ?[-|\||\/] ?\d dag: \d{1,2}
For matching with DATE spans, e.g. [1938 - 1992] (DATE):
\d{2,4} ?[-|–] ?\d{2,4}
For matching companies with A/S og ApS,
e.g. [Hansens Skomager A/S] (ORGANIZATION): ApS A\/S
For matching written numerals, e.g. "en":
to | to$|^to| To | To$|^To| TO | TO$|^TO| tre | tre$|^tre| Tre | Tre$|^Tre| TRE | TRE$|^TRE| fire | fire$|^fire| Fire | Fire$|^Fire| FIRE | FIRE$|^FIRE| fem | fem$|^fem| Fem | Fem$|^Fem| FEM | FEM$|^FEM| seks | seks$|^seks| Seks | Seks$|^Seks| SEKS | SEKS$| ^SYV| otte | otte$|^otte| Otte | Otte$|^Otte| OTTE | OTTE$|^OTTE| ni | ni$|^ni| Ni | Ni$|^Ni| NI | NI$|^NI| ti | ti$|^ti| Ti | Ti$|^Ti| TI | TI$|^TI
For matching "Himlen" or "Himmelen" already annotated as LOCATION, e.g. "HIMLEN":
[Hh][iI][mM][lL][Ee][Nn]|[Hh][iI][mM][mM][Ee][lL][Ee][Nn]
For matching "Gud" already tagged as PERSON, e.g. "GUD":
[Gg][Uu][Dd]
For matching telephone numbers wrongly already tagged as CARDINAL, e.g. "20 40 44 30":
\d{2} \d{2} \d{2} \d{2} \+\d{2} \d{2} ?\d{2} ?\d{2} ?\d{2}$ \+\d{2} \d{2} ?\d{2} ?\d{2} ?\d{2}$ \d{4} ?\d{4}$ ^\d{4} ?\d{4}$
For matching websites already wrongly tagged as ORGANIZATION:
.dk$|.com$
For matching Hotels and Resorts already wrongly tagged as ORGANIZATION:
.*[h|H]otel.*|.*[R|r]esort.*
For matching numbers including / or :, already wrongly tagged as CARDINAL:
\/ \/ -
For matching rights already wrongly tagged as LAW:
[C|c]opyright [®|©] [f|F]ortrydelsesret [o|O]phavsret$ enneskeret
Creative Commons Attribution-Share Alike 4.0 International license
The paper is in progress.