数据集:
strombergnlp/twitter_pos
任务:
标记分类子任务:
part-of-speech语言:
en计算机处理:
monolingual大小:
10K<n<100K语言创建人:
found批注创建人:
expert-generated源数据集:
original许可:
cc-by-4.0Part-of-speech information is basic NLP task. However, Twitter text is difficult to part-of-speech tag: it is noisy, with linguistic errors and idiosyncratic style. This dataset contains two datasets for English PoS tagging for tweets:
Splits defined in the Derczynski paper, but the data is from Ritter and Foster.
English, non-region-specific. bcp47:en
An example of 'train' looks as follows.
{'id': '0', 'tokens': ['Antick', 'Musings', 'post', ':', 'Book-A-Day', '2010', '#', '243', '(', '10/4', ')', '--', 'Gray', 'Horses', 'by', 'Hope', 'Larson', 'http://bit.ly/as8fvc'], 'pos_tags': [23, 23, 22, 9, 23, 12, 22, 12, 5, 12, 6, 9, 23, 23, 16, 23, 23, 51]}
The data fields are the same among all splits.
twitter-posname | tokens | sentences |
---|---|---|
ritter train | 10652 | 551 |
ritter dev | 2242 | 118 |
ritter test | 2291 | 118 |
foster dev | 2998 | 270 |
foster test | 2841 | 250 |
@inproceedings{ritter2011named, title={Named entity recognition in tweets: an experimental study}, author={Ritter, Alan and Clark, Sam and Etzioni, Oren and others}, booktitle={Proceedings of the 2011 conference on empirical methods in natural language processing}, pages={1524--1534}, year={2011} } @inproceedings{foster2011hardtoparse, title={\# hardtoparse: POS Tagging and Parsing the Twitterverse}, author={Foster, Jennifer and Cetinoglu, Ozlem and Wagner, Joachim and Le Roux, Joseph and Hogan, Stephen and Nivre, Joakim and Hogan, Deirdre and Van Genabith, Josef}, booktitle={Workshops at the Twenty-Fifth AAAI Conference on Artificial Intelligence}, year={2011} } @inproceedings{derczynski2013twitter, title={Twitter part-of-speech tagging for all: Overcoming sparse and noisy data}, author={Derczynski, Leon and Ritter, Alan and Clark, Sam and Bontcheva, Kalina}, booktitle={Proceedings of the international conference recent advances in natural language processing ranlp 2013}, pages={198--206}, year={2013} }
Author uploaded ( @leondz )