classla/reldi_sr | ATYUN.COM 官网-人工智能教程资讯全方位服务平台

数据集:

classla/reldi_sr

任务:

task_categories:other

子任务:

lemmatization named-entity-recognition part-of-speech

语言:

其他:

structure-prediction normalization tokenization

许可:

cc-by-sa-4.0

数据集介绍文件清单

中文

This dataset is based on 3,748 Serbian tweets that were segmented into sentences, tokens, and annotated with normalized forms, lemmas, MULTEXT-East tags (XPOS), UPOS tags and morphological features, and named entities.

The dataset contains 5462 training samples (sentences), 711 validation samples and 725 test samples. Each sample represents a sentence and includes the following features: sentence ID ('sent_id'), list of tokens ('tokens'), list of normalised tokens ('norms'), list of lemmas ('lemmas'), list of UPOS tags ('upos_tags'), list of MULTEXT-East tags ('xpos_tags), list of morphological features ('feats'), and list of named entity IOB tags ('iob_tags'), which are encoded as class labels.

If you are using this dataset in your research, please cite the following paper:

@article{Miličević_Ljubešić_2016,
title={Tviterasi, tviteraši or twitteraši? Producing and analysing a normalised dataset of Croatian and Serbian tweets}, 
volume={4}, 
url={https://revije.ff.uni-lj.si/slovenscina2/article/view/7007}, 
DOI={10.4312/slo2.0.2016.2.156-188}, 
number={2}, 
journal={Slovenščina 2.0: empirical, applied and interdisciplinary research}, 
author={Miličević, Maja and Ljubešić, Nikola}, 
year={2016}, 
month={Sep.}, 
pages={156–188} }

作者:

classla

数据集大小:

808.53 KB