数据集:

ruanchaves/nru_hse

语言:

ru

计算机处理:

monolingual

语言创建人:

machine-generated

批注创建人:

expert-generated

源数据集:

original

预印本库:

arxiv:1911.03270
中文

Dataset Card for NRU-HSE

Dataset Summary

Real hashtags collected from several pages about civil services on vk.com (a Russian social network) and then segmented manually.

Languages

Russian

Dataset Structure

Data Instances

{
  "index": 0, 
  "hashtag": "ЁлкаВЗазеркалье",
  "segmentation": "Ёлка В Зазеркалье"
}

Data Fields

  • index : a numerical index.
  • hashtag : the original hashtag.
  • segmentation : the gold segmentation for the hashtag.

Dataset Creation

  • All hashtag segmentation and identifier splitting datasets on this profile have the same basic fields: hashtag and segmentation or identifier and segmentation .

  • The only difference between hashtag and segmentation or between identifier and segmentation are the whitespace characters. Spell checking, expanding abbreviations or correcting characters to uppercase go into other fields.

  • There is always whitespace between an alphanumeric character and a sequence of any special characters ( such as _ , : , ~ ).

  • If there are any annotations for named entity recognition and other token classification tasks, they are given in a spans field.

Additional Information

Citation Information

@article{glushkova2019char,
  title={Char-RNN and Active Learning for Hashtag Segmentation},
  author={Glushkova, Taisiya and Artemova, Ekaterina},
  journal={arXiv preprint arXiv:1911.03270},
  year={2019}
}

Contributions

This dataset was added by @ruanchaves while developing the hashformers library.