classla/ssj500k | ATYUN.COM 官网-人工智能教程资讯全方位服务平台

数据集:

classla/ssj500k

任务:

标记分类

子任务:

lemmatization named-entity-recognition parsing

语言:

其他:

structure-prediction tokenization dependency-parsing

许可:

cc-by-sa-4.0

数据集介绍文件清单

中文

The dataset contains 7432 training samples, 1164 validation samples and 893 test samples. Each sample represents a sentence and includes the following features: sentence ID ('sent_id'), list of tokens ('tokens'), list of lemmas ('lemmas'), list of Multext-East tags ('xpos_tags), list of UPOS tags ('upos_tags'), list of morphological features ('feats'), list of IOB tags ('iob_tags'), and list of universal dependency tags ('uds'). Three dataset configurations are available, where the corresponding features are encoded as class labels: 'ner', 'upos', and 'ud'.

作者:

classla

数据集大小:

3.18 MB