数据集:
ruanchaves/dev_stanford
许可:
license:unknown源数据集:
original批注创建人:
expert-generated语言创建人:
machine-generated计算机处理:
monolingual语言:
en1000 hashtags manually segmented by Çelebi et al. for development purposes, randomly selected from the Stanford Sentiment Tweet Corpus by Sentiment140.
English
{ "index": 15, "hashtag": "marathonmonday", "segmentation": "marathon monday" }
All hashtag segmentation and identifier splitting datasets on this profile have the same basic fields: hashtag and segmentation or identifier and segmentation .
The only difference between hashtag and segmentation or between identifier and segmentation are the whitespace characters. Spell checking, expanding abbreviations or correcting characters to uppercase go into other fields.
There is always whitespace between an alphanumeric character and a sequence of any special characters ( such as _ , : , ~ ).
If there are any annotations for named entity recognition and other token classification tasks, they are given in a spans field.
@article{celebi2018segmenting, title={Segmenting hashtags and analyzing their grammatical structure}, author={Celebi, Arda and {\"O}zg{\"u}r, Arzucan}, journal={Journal of the Association for Information Science and Technology}, volume={69}, number={5}, pages={675--686}, year={2018}, publisher={Wiley Online Library} }
This dataset was added by @ruanchaves while developing the hashformers library.