数据集:
strombergnlp/twitter_pos_vcb
任务:
标记分类子任务:
part-of-speech语言:
en计算机处理:
monolingual大小:
1M<n<10M语言创建人:
found批注创建人:
machine-generated源数据集:
original许可:
cc-by-4.0Part-of-speech information is basic NLP task. However, Twitter text is difficult to part-of-speech tag: it is noisy, with linguistic errors and idiosyncratic style. This data is the vote-constrained bootstrapped data generate to support state-of-the-art results.
The data is about 1.5 million English tweets annotated for part-of-speech using Ritter's extension of the PTB tagset. The tweets are from 2012 and 2013, tokenized using the GATE tokenizer and tagged jointly using the CMU ARK tagger and Ritter's T-POS tagger. Only when both these taggers' outputs are completely compatible over a whole tweet, is that tweet added to the dataset.
This data is recommend for use a training data only , and not evaluation data.
For more details see https://gate.ac.uk/wiki/twitter-postagger.html and https://aclanthology.org/R13-1026.pdf
English, non-region-specific. bcp47:en
An example of 'train' looks as follows.
The data fields are the same among all splits.
twitter_pos_vcbname | tokens | sentences |
---|---|---|
twitter-pos-vcb | 1 543 126 | 159 492 |
Creative Commons Attribution 4.0 (CC-BY)
@inproceedings{derczynski2013twitter, title={Twitter part-of-speech tagging for all: Overcoming sparse and noisy data}, author={Derczynski, Leon and Ritter, Alan and Clark, Sam and Bontcheva, Kalina}, booktitle={Proceedings of the international conference recent advances in natural language processing ranlp 2013}, pages={198--206}, year={2013} }
Author uploaded ( @leondz )