数据集:
ipipan/nkjp1m
许可:
cc-by-4.0源数据集:
original批注创建人:
expert-generated语言创建人:
expert-generated大小:
10K<n<100K计算机处理:
monolingual语言:
pl任务:
标记分类This is the official dataset for NKJP1M – the 1-million token balanced subcorpus of the National Corpus of Polish (Narodowy Korpus Języka Polskiego)
Besides the text (divided into paragraphs/samples and sentences) the set contains lemmas and morpho-syntactic tags for all tokens in the corpus.
This release, known as NKJP1M-SGJP, corresponds to the version 1.2 of the corpus with later corrections and improvements. In particular the morpho-syntactic annotation has been aligned with the present version of Morfeusz2 SGJP morphological analyser (as of 2022.12.04).
The main use of this resource lays in training models for lemmatisation and part of speech tagging of Polish.
Polish (monolingual)
{'nkjp_text': 'NKJP_1M_1102000002', 'nkjp_par': 'morph_1-p', 'nkjp_sent': 'morph_1.18-s', 'tokens': ['-', 'Nie', 'mam', 'pieniędzy', ',', 'da', 'mi', 'pani', 'wywiad', '?'], 'lemmas': ['-', 'nie', 'mieć', 'pieniądz', ',', 'dać', 'ja', 'pani', 'wywiad', '?'], 'cposes': [8, 11, 10, 9, 8, 10, 9, 9, 9, 8], 'poses': [19, 25, 12, 35, 19, 12, 28, 35, 35, 19], 'tags': [266, 464, 213, 923, 266, 218, 692, 988, 961, 266], 'nps': [False, False, False, False, True, False, False, False, False, True], 'nkjp_ids': ['morph_1.9-seg', 'morph_1.10-seg', 'morph_1.11-seg', 'morph_1.12-seg', 'morph_1.13-seg', 'morph_1.14-seg', 'morph_1.15-seg', 'morph_1.16-seg', 'morph_1.17-seg', 'morph_1.18-seg']}
Train | Validation | Test | |
---|---|---|---|
sentences | 68943 | 7755 | 8964 |
tokens | 978368 | 112454 | 125059 |
The National Corpus of Polish (NKJP) was envisioned as the reference corpus of contemporary Polish.
The manually annotated subcorpus (NKJP1M) was thought of as the training data for various NLP tasks.
NKJP is balanced with respect to Polish readership. The detailed rationale is described in Chapter 3 of the NKJP book (roughly: 50% press, 30% books, 10% speech, 10% other). The corpus contains texts from the years 1945–2010 (with 80% of the text in the range 1990–2010). Only original Polish texts were gathered (no translations from other languages). The composition of NKJP1M follows this schema (see Chapter 5).
The rules of morphosyntactic annotation used for NKJP are discussed in Chapter 6 of the NKJP book . Presently (2020), the corpus uses a common tagset with the morphological analyzer Morfeusz 2 .
Annotation processThe texts were processed with Morfeusz and then the resulting annotations were manually disambiguated and validated/corrected. Each text sample was independently processed by two annotators. In case of annotation conflicts an adjudicator stepped in.
This work is licensed under a Creative Commons Attribution 4.0 International License .
Info on the source corpus: link
@Book{nkjp:12, editor = "Adam Przepiórkowski and Mirosław Bańko and Rafał L. Górski and Barbara Lewandowska-Tomaszczyk", title = "Narodowy Korpus Języka Polskiego", year = 2012, address = "Warszawa", pdf = "http://nkjp.pl/settings/papers/NKJP_ksiazka.pdf", publisher = "Wydawnictwo Naukowe PWN"}
Current annotation scheme: link
@article{ kie:etal:21, author = "Kieraś, Witold and Woliński, Marcin and Nitoń, Bartłomiej", doi = "https://doi.org/10.31286/JP.101.2.5", title = "Nowe wielowarstwowe znakowanie lingwistyczne zrównoważonego {N}arodowego {K}orpusu {J}ęzyka {P}olskiego", url = "https://jezyk-polski.pl/index.php/jp/article/view/72", journal = "Język Polski", number = "2", volume = "CI", year = "2021", pages = "59--70" }