数据集:

ruanchaves/dev_stanford

源数据集:

original

批注创建人:

expert-generated

语言创建人:

machine-generated

计算机处理:

monolingual

语言:

en
中文

Dataset Card for Dev-Stanford

Dataset Summary

1000 hashtags manually segmented by Çelebi et al. for development purposes, randomly selected from the Stanford Sentiment Tweet Corpus by Sentiment140.

Languages

English

Dataset Structure

Data Instances

{
    "index": 15,
    "hashtag": "marathonmonday",
    "segmentation": "marathon monday"
}

Data Fields

  • index : a numerical index.
  • hashtag : the original hashtag.
  • segmentation : the gold segmentation for the hashtag.

Dataset Creation

  • All hashtag segmentation and identifier splitting datasets on this profile have the same basic fields: hashtag and segmentation or identifier and segmentation .

  • The only difference between hashtag and segmentation or between identifier and segmentation are the whitespace characters. Spell checking, expanding abbreviations or correcting characters to uppercase go into other fields.

  • There is always whitespace between an alphanumeric character and a sequence of any special characters ( such as _ , : , ~ ).

  • If there are any annotations for named entity recognition and other token classification tasks, they are given in a spans field.

Additional Information

Citation Information

@article{celebi2018segmenting,
  title={Segmenting hashtags and analyzing their grammatical structure},
  author={Celebi, Arda and {\"O}zg{\"u}r, Arzucan},
  journal={Journal of the Association for Information Science and Technology},
  volume={69},
  number={5},
  pages={675--686},
  year={2018},
  publisher={Wiley Online Library}
}

Contributions

This dataset was added by @ruanchaves while developing the hashformers library.