数据集:
clarin-pl/kpwr-ner
KPWR-NER is a part the Polish Corpus of Wrocław University of Technology ( Korpus Języka Polskiego Politechniki Wrocławskiej ). Its objective is named entity recognition for fine-grained categories of entities. It is the ‘n82’ version of the KPWr, which means that number of classes is restricted to 82 (originally 120). During corpus creation, texts were annotated by humans from various sources, covering many domains and genres.
Named entity recognition (NER) - tagging entities in text with their corresponding type.
Input (' tokens' column): sequence of tokens
Output (' ner' column): sequence of predicted tokens’ classes in BIO notation (82 possible classes, described in detail in the annotation guidelines)
Measurements : F1-score (seqeval)
Example :
Input: [‘Roboty’, ‘mają’, ‘kilkanaście’, ‘lat’, ‘i’, ‘pochodzą’, ‘z’, ‘USA’, ‘,’, ‘Wysokie’, ‘napięcie’, ‘jest’, ‘dużo’, ‘młodsze’, ‘,’, ‘powstało’, ‘w’, ‘Niemczech’, ‘.’]
Input (translated by DeepL): Robots are more than a dozen years old and come from the US, High Voltage is much younger, having been developed in Germany.
Output: [‘B-nam_pro_title’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘B-nam_loc_gpe_country’, ‘O’, ‘B-nam_pro_title’, ‘I-nam_pro_title’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘B-nam_loc_gpe_country’, ‘O’]
Subset | Cardinality (sentences) |
---|---|
train | 13959 |
dev | 0 |
test | 4323 |
Class | train | validation | test |
---|---|---|---|
B-nam_liv_person | 0.21910 | - | 0.21422 |
B-nam_loc_gpe_city | 0.10101 | - | 0.09865 |
B-nam_loc_gpe_country | 0.07467 | - | 0.08059 |
B-nam_org_institution | 0.05893 | - | 0.06005 |
B-nam_org_organization | 0.04448 | - | 0.05553 |
B-nam_org_group_team | 0.03492 | - | 0.03363 |
B-nam_adj_country | 0.03410 | - | 0.03747 |
B-nam_org_company | 0.02439 | - | 0.01716 |
B-nam_pro_media_periodic | 0.02250 | - | 0.01896 |
B-nam_fac_road | 0.01995 | - | 0.02144 |
B-nam_liv_god | 0.01934 | - | 0.00790 |
B-nam_org_nation | 0.01739 | - | 0.01828 |
B-nam_oth_tech | 0.01724 | - | 0.01377 |
B-nam_pro_media_web | 0.01709 | - | 0.00903 |
B-nam_fac_goe | 0.01596 | - | 0.01445 |
B-nam_eve_human | 0.01573 | - | 0.01761 |
B-nam_pro_title | 0.01558 | - | 0.00790 |
B-nam_pro_brand | 0.01543 | - | 0.01038 |
B-nam_org_political_party | 0.01264 | - | 0.01309 |
B-nam_loc_gpe_admin1 | 0.01219 | - | 0.01445 |
B-nam_eve_human_sport | 0.01174 | - | 0.01242 |
B-nam_pro_software | 0.01091 | - | 0.02190 |
B-nam_adj | 0.00963 | - | 0.01174 |
B-nam_loc_gpe_admin3 | 0.00888 | - | 0.01061 |
B-nam_pro_model_car | 0.00873 | - | 0.00587 |
B-nam_loc_hydronym_river | 0.00843 | - | 0.01151 |
B-nam_oth | 0.00775 | - | 0.00497 |
B-nam_pro_title_document | 0.00738 | - | 0.01986 |
B-nam_loc_astronomical | 0.00730 | - | - |
B-nam_oth_currency | 0.00723 | - | 0.01151 |
B-nam_adj_city | 0.00670 | - | 0.00948 |
B-nam_org_group_band | 0.00587 | - | 0.00429 |
B-nam_loc_gpe_admin2 | 0.00565 | - | 0.00813 |
B-nam_loc_gpe_district | 0.00504 | - | 0.00406 |
B-nam_loc_land_continent | 0.00459 | - | 0.00722 |
B-nam_loc_country_region | 0.00459 | - | 0.00090 |
B-nam_loc_land_mountain | 0.00414 | - | 0.00203 |
B-nam_pro_title_book | 0.00384 | - | 0.00248 |
B-nam_loc_historical_region | 0.00376 | - | 0.00497 |
B-nam_loc | 0.00361 | - | 0.00090 |
B-nam_eve | 0.00361 | - | 0.00181 |
B-nam_org_group | 0.00331 | - | 0.00406 |
B-nam_loc_land_island | 0.00331 | - | 0.00248 |
B-nam_pro_media_tv | 0.00316 | - | 0.00158 |
B-nam_liv_habitant | 0.00316 | - | 0.00158 |
B-nam_eve_human_cultural | 0.00316 | - | 0.00497 |
B-nam_pro_title_tv | 0.00309 | - | 0.00542 |
B-nam_oth_license | 0.00286 | - | 0.00248 |
B-nam_num_house | 0.00256 | - | 0.00248 |
B-nam_pro_title_treaty | 0.00248 | - | 0.00045 |
B-nam_fac_system | 0.00248 | - | 0.00587 |
B-nam_loc_gpe_subdivision | 0.00241 | - | 0.00587 |
B-nam_loc_land_region | 0.00226 | - | 0.00248 |
B-nam_pro_title_album | 0.00218 | - | 0.00158 |
B-nam_adj_person | 0.00203 | - | 0.00406 |
B-nam_fac_square | 0.00196 | - | 0.00135 |
B-nam_pro_award | 0.00188 | - | 0.00519 |
B-nam_eve_human_holiday | 0.00188 | - | 0.00203 |
B-nam_pro_title_song | 0.00166 | - | 0.00158 |
B-nam_pro_media_radio | 0.00151 | - | 0.00068 |
B-nam_pro_vehicle | 0.00151 | - | 0.00090 |
B-nam_oth_position | 0.00143 | - | 0.00226 |
B-nam_liv_animal | 0.00143 | - | 0.00248 |
B-nam_pro | 0.00135 | - | 0.00045 |
B-nam_oth_www | 0.00120 | - | 0.00451 |
B-nam_num_phone | 0.00120 | - | 0.00045 |
B-nam_pro_title_article | 0.00113 | - | - |
B-nam_oth_data_format | 0.00113 | - | 0.00226 |
B-nam_fac_bridge | 0.00105 | - | 0.00090 |
B-nam_liv_character | 0.00098 | - | - |
B-nam_pro_software_game | 0.00090 | - | 0.00068 |
B-nam_loc_hydronym_lake | 0.00090 | - | 0.00045 |
B-nam_loc_gpe_conurbation | 0.00090 | - | - |
B-nam_pro_media | 0.00083 | - | 0.00181 |
B-nam_loc_land | 0.00075 | - | 0.00045 |
B-nam_loc_land_peak | 0.00075 | - | - |
B-nam_fac_park | 0.00068 | - | 0.00226 |
B-nam_org_organization_sub | 0.00060 | - | 0.00068 |
B-nam_loc_hydronym | 0.00060 | - | 0.00023 |
B-nam_loc_hydronym_sea | 0.00045 | - | 0.00068 |
B-nam_loc_hydronym_ocean | 0.00045 | - | 0.00023 |
B-nam_fac_goe_stop | 0.00038 | - | 0.00090 |
@inproceedings{broda-etal-2012-kpwr, title = "{KPW}r: Towards a Free Corpus of {P}olish", author = "Broda, Bartosz and Marci{\'n}czuk, Micha{\l} and Maziarz, Marek and Radziszewski, Adam and Wardy{\'n}ski, Adam", booktitle = "Proceedings of the Eighth International Conference on Language Resources and Evaluation ({LREC}'12)", month = may, year = "2012", address = "Istanbul, Turkey", publisher = "European Language Resources Association (ELRA)", url = "http://www.lrec-conf.org/proceedings/lrec2012/pdf/965_Paper.pdf", pages = "3218--3222", abstract = "This paper presents our efforts aimed at collecting and annotating a free Polish corpus. The corpus will serve for us as training and testing material for experiments with Machine Learning algorithms. As others may also benefit from the resource, we are going to release it under a Creative Commons licence, which is hoped to remove unnecessary usage restrictions, but also to facilitate reproduction of our experimental results. The corpus is being annotated with various types of linguistic entities: chunks and named entities, selected syntactic and semantic relations, word senses and anaphora. We report on the current state of the project as well as our ultimate goals.", }
Creative Commons Attribution 3.0 Unported Licence
KPWr annotation guidelines - named entities
from pprint import pprint from datasets import load_dataset dataset = load_dataset("clarin-pl/kpwr-ner") pprint(dataset['train'][0]) # {'lemmas': ['roborally', 'czy', 'wysoki', 'napięcie', '?'], # 'ner': [73, 160, 73, 151, 160], # 'orth': ['subst:sg:nom:n', # 'qub', # 'adj:sg:nom:n:pos', # 'subst:sg:nom:n', # 'interp'], # 'tokens': ['RoboRally', 'czy', 'Wysokie', 'napięcie', '?']}
import random from pprint import pprint from datasets import load_dataset, load_metric dataset = load_dataset("clarin-pl/kpwr-ner") references = dataset["test"]["ner"] # generate random predictions predictions = [ [ random.randrange(dataset["train"].features["ner"].feature.num_classes) for _ in range(len(labels)) ] for labels in references ] # transform to original names of labels references_named = [ [dataset["train"].features["ner"].feature.names[label] for label in labels] for labels in references ] predictions_named = [ [dataset["train"].features["ner"].feature.names[label] for label in labels] for labels in predictions ] # utilise seqeval to evaluate seqeval = load_metric("seqeval") seqeval_score = seqeval.compute( predictions=predictions_named, references=references_named, scheme="IOB2" ) pprint(seqeval_score, depth=1) # {'nam_adj': {...}, # 'nam_adj_city': {...}, # 'nam_adj_country': {...}, # 'nam_adj_person': {...}, # 'nam_eve': {...}, # 'nam_eve_human': {...}, # 'nam_eve_human_cultural': {...}, # 'nam_eve_human_holiday': {...}, # 'nam_eve_human_sport': {...}, # 'nam_fac_bridge': {...}, # 'nam_fac_goe': {...}, # 'nam_fac_goe_stop': {...}, # 'nam_fac_park': {...}, # 'nam_fac_road': {...}, # 'nam_fac_square': {...}, # 'nam_fac_system': {...}, # 'nam_liv_animal': {...}, # 'nam_liv_character': {...}, # 'nam_liv_god': {...}, # 'nam_liv_habitant': {...}, # 'nam_liv_person': {...}, # 'nam_loc': {...}, # 'nam_loc_astronomical': {...}, # 'nam_loc_country_region': {...}, # 'nam_loc_gpe_admin1': {...}, # 'nam_loc_gpe_admin2': {...}, # 'nam_loc_gpe_admin3': {...}, # 'nam_loc_gpe_city': {...}, # 'nam_loc_gpe_conurbation': {...}, # 'nam_loc_gpe_country': {...}, # 'nam_loc_gpe_district': {...}, # 'nam_loc_gpe_subdivision': {...}, # 'nam_loc_historical_region': {...}, # 'nam_loc_hydronym': {...}, # 'nam_loc_hydronym_lake': {...}, # 'nam_loc_hydronym_ocean': {...}, # 'nam_loc_hydronym_river': {...}, # 'nam_loc_hydronym_sea': {...}, # 'nam_loc_land': {...}, # 'nam_loc_land_continent': {...}, # 'nam_loc_land_island': {...}, # 'nam_loc_land_mountain': {...}, # 'nam_loc_land_peak': {...}, # 'nam_loc_land_region': {...}, # 'nam_num_house': {...}, # 'nam_num_phone': {...}, # 'nam_org_company': {...}, # 'nam_org_group': {...}, # 'nam_org_group_band': {...}, # 'nam_org_group_team': {...}, # 'nam_org_institution': {...}, # 'nam_org_nation': {...}, # 'nam_org_organization': {...}, # 'nam_org_organization_sub': {...}, # 'nam_org_political_party': {...}, # 'nam_oth': {...}, # 'nam_oth_currency': {...}, # 'nam_oth_data_format': {...}, # 'nam_oth_license': {...}, # 'nam_oth_position': {...}, # 'nam_oth_tech': {...}, # 'nam_oth_www': {...}, # 'nam_pro': {...}, # 'nam_pro_award': {...}, # 'nam_pro_brand': {...}, # 'nam_pro_media': {...}, # 'nam_pro_media_periodic': {...}, # 'nam_pro_media_radio': {...}, # 'nam_pro_media_tv': {...}, # 'nam_pro_media_web': {...}, # 'nam_pro_model_car': {...}, # 'nam_pro_software': {...}, # 'nam_pro_software_game': {...}, # 'nam_pro_title': {...}, # 'nam_pro_title_album': {...}, # 'nam_pro_title_article': {...}, # 'nam_pro_title_book': {...}, # 'nam_pro_title_document': {...}, # 'nam_pro_title_song': {...}, # 'nam_pro_title_treaty': {...}, # 'nam_pro_title_tv': {...}, # 'nam_pro_vehicle': {...}, # 'overall_accuracy': 0.006156203762418094, # 'overall_f1': 0.0009844258777797407, # 'overall_precision': 0.0005213624939842789, # 'overall_recall': 0.008803611738148984}