数据集:

ipipan/nkjp1m

中文

Dataset Card for NKJP1M – The manually annotated subcorpus of the National Corpus of Polish

Dataset Summary

This is the official dataset for NKJP1M – the 1-million token balanced subcorpus of the National Corpus of Polish (Narodowy Korpus Języka Polskiego)

Besides the text (divided into paragraphs/samples and sentences) the set contains lemmas and morpho-syntactic tags for all tokens in the corpus.

This release, known as NKJP1M-SGJP, corresponds to the version 1.2 of the corpus with later corrections and improvements. In particular the morpho-syntactic annotation has been aligned with the present version of Morfeusz2 SGJP morphological analyser (as of 2022.12.04).

Supported Tasks and Leaderboards

The main use of this resource lays in training models for lemmatisation and part of speech tagging of Polish.

Languages

Polish (monolingual)

Dataset Structure

Data Instances

{'nkjp_text': 'NKJP_1M_1102000002',
 'nkjp_par': 'morph_1-p',
 'nkjp_sent': 'morph_1.18-s',
 'tokens': ['-', 'Nie', 'mam', 'pieniędzy', ',', 'da', 'mi', 'pani', 'wywiad', '?'],
 'lemmas': ['-', 'nie', 'mieć', 'pieniądz', ',', 'dać', 'ja', 'pani', 'wywiad', '?'],
 'cposes': [8, 11, 10, 9, 8, 10, 9, 9, 9, 8],
 'poses': [19, 25, 12, 35, 19, 12, 28, 35, 35, 19],
 'tags': [266, 464, 213, 923, 266, 218, 692, 988, 961, 266],
 'nps': [False, False, False, False, True, False, False, False, False, True],
 'nkjp_ids': ['morph_1.9-seg', 'morph_1.10-seg', 'morph_1.11-seg', 'morph_1.12-seg', 'morph_1.13-seg', 'morph_1.14-seg', 'morph_1.15-seg', 'morph_1.16-seg', 'morph_1.17-seg', 'morph_1.18-seg']}

Data Fields

  • nkjp_text , nkjp_par , nkjp_sent (strings): XML identifiers of the present text (document), paragraph and sentence in NKJP. (These allow to map the data point back to the source corpus and to identify paragraphs/samples.)
  • tokens (sequence of strings): tokens of the text defined as in NKJP.
  • lemmas (sequence of strings): lemmas corresponding to the tokens.
  • tags (sequence of labels): morpho-syntactic tags according to Morfeusz2 tagset (1019 distinct tags).
  • poses (sequence of labels): flexemic class (detailed part of speech, 40 classes) – the first element of the corresponding tag.
  • cposes (sequence of labels): coarse part of speech (13 classes): all verbal and deverbal flexemic classes get mapped to a V , nominal – N , adjectival – A , “strange” (abbreviations, alien elements, symbols, emojis…) – X , rest as in poses .
  • nps (sequence of booleans): True means that the corresponding token is not preceded by a space in the source text.
  • nkjp_ids (sequence of strings): XML identifiers of particular tokens in NKJP (probably an overkill).

Data Splits

Train Validation Test
sentences 68943 7755 8964
tokens 978368 112454 125059

Dataset Creation

Curation Rationale

The National Corpus of Polish (NKJP) was envisioned as the reference corpus of contemporary Polish.

The manually annotated subcorpus (NKJP1M) was thought of as the training data for various NLP tasks.

Source Data

NKJP is balanced with respect to Polish readership. The detailed rationale is described in Chapter 3 of the NKJP book (roughly: 50% press, 30% books, 10% speech, 10% other). The corpus contains texts from the years 1945–2010 (with 80% of the text in the range 1990–2010). Only original Polish texts were gathered (no translations from other languages). The composition of NKJP1M follows this schema (see Chapter 5).

Annotations

The rules of morphosyntactic annotation used for NKJP are discussed in Chapter 6 of the NKJP book . Presently (2020), the corpus uses a common tagset with the morphological analyzer Morfeusz 2 .

Annotation process

The texts were processed with Morfeusz and then the resulting annotations were manually disambiguated and validated/corrected. Each text sample was independently processed by two annotators. In case of annotation conflicts an adjudicator stepped in.

Licensing Information

This work is licensed under a Creative Commons Attribution 4.0 International License .

Citation Information

Info on the source corpus: link

@Book{nkjp:12,
  editor =       "Adam Przepiórkowski and Mirosław Bańko and Rafał
                  L. Górski and Barbara Lewandowska-Tomaszczyk",
  title =        "Narodowy Korpus Języka Polskiego",
  year =         2012,
  address =      "Warszawa",
  pdf =          "http://nkjp.pl/settings/papers/NKJP_ksiazka.pdf",
  publisher =    "Wydawnictwo Naukowe PWN"}

Current annotation scheme: link

@article{
    kie:etal:21,
    author = "Kieraś, Witold and Woliński, Marcin and Nitoń, Bartłomiej",
    doi = "https://doi.org/10.31286/JP.101.2.5",
    title = "Nowe wielowarstwowe znakowanie lingwistyczne zrównoważonego {N}arodowego {K}orpusu {J}ęzyka {P}olskiego",
    url = "https://jezyk-polski.pl/index.php/jp/article/view/72",
    journal = "Język Polski",
    number = "2",
    volume = "CI",
    year = "2021",
    pages = "59--70"
}