数据集:
KBLab/sucx3_ner
The dataset is a conversion of the venerable SUC 3.0 dataset into the huggingface ecosystem. The original dataset does not contain an official train-dev-test split, which is introduced here; the tag distribution for the NER tags between the three splits is mostly the same.
The dataset has three different types of tagsets: manually annotated POS, manually annotated NER, and automatically annotated NER. For the automatically annotated NER tags, only sentences were chosen, where the automatic and manual annotations would match (with their respective categories).
Additionally we provide remixes of the same data with some or all sentences being lowercased.
Swedish
For each instance, there is an id , with an optional _lower suffix to mark that it has been modified, a tokens list of strings containing tokens, a pos_tags list of strings containing POS-tags, and a ner_tags list of strings containing NER-tags.
{"id": "e24d782c-e2475603_lower", "tokens": ["-", "dels", "har", "vi", "inget", "index", "att", "g\u00e5", "efter", ",", "vi", "kr\u00e4ver", "allts\u00e5", "ers\u00e4ttning", "i", "40-talets", "penningv\u00e4rde", "."], "pos_tags": ["MID", "KN", "VB", "PN", "DT", "NN", "IE", "VB", "PP", "MID", "PN", "VB", "AB", "NN", "PP", "NN", "NN", "MAD"], "ner_tags": ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"]}
Dataset Split | Size Percentage of Total Dataset Size | Number of Instances for the Original Tags |
---|---|---|
train | 64% | 46,026 |
dev | 16% | 11,506 |
test | 20% | 14,383 |
The simple_tags remix has fewer instances due to the requirement to match tags.
See the original webpage
Språkbanken
CC BY 4.0 (attribution)
Thanks to @robinqrtz for adding this dataset.