数据集:

KBLab/sucx3_ner

任务:

task_categories:other

子任务:

named-entity-recognition part-of-speech

语言:

计算机处理:

monolingual

大小:

size_categories:unknown

语言创建人:

other

批注创建人:

expert-generated

源数据集:

original

其他:

structure-prediction

许可:

cc-by-4.0

数据集介绍文件清单

中文

Dataset Card for SUCX 3.0 - NER

Dataset Summary

The dataset is a conversion of the venerable SUC 3.0 dataset into the huggingface ecosystem. The original dataset does not contain an official train-dev-test split, which is introduced here; the tag distribution for the NER tags between the three splits is mostly the same.

The dataset has three different types of tagsets: manually annotated POS, manually annotated NER, and automatically annotated NER. For the automatically annotated NER tags, only sentences were chosen, where the automatic and manual annotations would match (with their respective categories).

Additionally we provide remixes of the same data with some or all sentences being lowercased.

Supported Tasks and Leaderboards

Part-of-Speech tagging
Named-Entity-Recognition

Languages

Swedish

Dataset Structure

Data Remixes

original_tags contain the manual NER annotations
- lower the whole dataset uncased
- lower_mix some of the dataset uncased
- lower_both every instance both cased and uncased
simple_tags contain the automatic NER annotations
- lower the whole dataset uncased
- lower_mix some of the dataset uncased
- lower_both every instance both cased and uncased

Data Instances

For each instance, there is an id , with an optional _lower suffix to mark that it has been modified, a tokens list of strings containing tokens, a pos_tags list of strings containing POS-tags, and a ner_tags list of strings containing NER-tags.

{"id": "e24d782c-e2475603_lower",
"tokens": ["-", "dels", "har", "vi", "inget", "index", "att", "g\u00e5", "efter", ",", "vi", "kr\u00e4ver", "allts\u00e5", "ers\u00e4ttning", "i", "40-talets", "penningv\u00e4rde", "."],
"pos_tags": ["MID", "KN", "VB", "PN", "DT", "NN", "IE", "VB", "PP", "MID", "PN", "VB", "AB", "NN", "PP", "NN", "NN", "MAD"],
"ner_tags": ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"]}

Data Fields

id : a string containing the sentence-id
tokens : a list of strings containing the sentence's tokens
pos_tags : a list of strings containing the tokens' POS annotations
ner_tags : a list of strings containing the tokens' NER annotations

Data Splits

Dataset Split	Size Percentage of Total Dataset Size	Number of Instances for the Original Tags
train	64%	46,026
dev	16%	11,506
test	20%	14,383

The simple_tags remix has fewer instances due to the requirement to match tags.

Dataset Creation

See the original webpage

Additional Information

Dataset Curators

Språkbanken

Licensing Information

CC BY 4.0 (attribution)

Citation Information

SUC 2.0 manual

Contributions

Thanks to @robinqrtz for adding this dataset.

作者:

KBLab

数据集大小:

32.53 MB