classla/setimes_sr | ATYUN.COM 官网-人工智能教程资讯全方位服务平台

数据集:

classla/setimes_sr

任务:

task_categories:other

子任务:

lemmatization named-entity-recognition part-of-speech

语言:

其他:

structure-prediction normalization tokenization

许可:

cc-by-sa-4.0

数据集介绍文件清单

中文

The SETimes_sr training corpus contains 86,726 Serbian tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, named entities and dependency syntax.

The dataset contains 3177 training samples, 395 validation samples and 319 test samples across the respective data splits. Each sample represents a sentence and includes the following features: sentence ID ('sent_id'), sentence text ('text'), list of tokens ('tokens'), list of lemmas ('lemmas'), list of MULTEXT-East tags ('xpos_tags), list of UPOS tags ('upos_tags'), list of morphological features ('feats'), list of IOB tags ('iob_tags') and list of universal dependencies ('uds').

Three dataset configurations are available, namely 'ner', 'upos', and 'ud', with the corresponding features encoded as class labels. If the configuration is not specified, it defaults to 'ner'.

If you use this dataset in your research, please cite the following paper:

@inproceedings{samardzic-etal-2017-universal,
    title = "{U}niversal {D}ependencies for {S}erbian in Comparison with {C}roatian and Other {S}lavic Languages",
    author = "Samard{\v{z}}i{\'c}, Tanja  and
      Starovi{\'c}, Mirjana  and
      Agi{\'c}, {\v{Z}}eljko  and
      Ljube{\v{s}}i{\'c}, Nikola",
    booktitle = "Proceedings of the 6th Workshop on {B}alto-{S}lavic Natural Language Processing",
    month = apr,
    year = "2017",
    address = "Valencia, Spain",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/W17-1407",
    doi = "10.18653/v1/W17-1407",
    pages = "39--44",
}

作者:

classla

数据集大小:

1.72 MB