数据集:

strombergnlp/twitter_pos_vcb

子任务:

part-of-speech

语言:

en

计算机处理:

monolingual

大小:

1M<n<10M

语言创建人:

found

批注创建人:

machine-generated

源数据集:

original

许可:

cc-by-4.0
中文

Dataset Card for "twitter-pos-vcb"

Dataset Summary

Part-of-speech information is basic NLP task. However, Twitter text is difficult to part-of-speech tag: it is noisy, with linguistic errors and idiosyncratic style. This data is the vote-constrained bootstrapped data generate to support state-of-the-art results.

The data is about 1.5 million English tweets annotated for part-of-speech using Ritter's extension of the PTB tagset. The tweets are from 2012 and 2013, tokenized using the GATE tokenizer and tagged jointly using the CMU ARK tagger and Ritter's T-POS tagger. Only when both these taggers' outputs are completely compatible over a whole tweet, is that tweet added to the dataset.

This data is recommend for use a training data only , and not evaluation data.

For more details see https://gate.ac.uk/wiki/twitter-postagger.html and https://aclanthology.org/R13-1026.pdf

Supported Tasks and Leaderboards

More Information Needed

Languages

English, non-region-specific. bcp47:en

Dataset Structure

Data Instances

An example of 'train' looks as follows.

Data Fields

The data fields are the same among all splits.

twitter_pos_vcb
  • id : a string feature.
  • tokens : a list of string features.
  • pos_tags : a list of classification labels ( int ). Full tagset with indices:

Data Splits

name tokens sentences
twitter-pos-vcb 1 543 126 159 492

Dataset Creation

Curation Rationale

More Information Needed

Source Data

Initial Data Collection and Normalization

More Information Needed

Who are the source language producers?

More Information Needed

Annotations

Annotation process

More Information Needed

Who are the annotators?

More Information Needed

Personal and Sensitive Information

More Information Needed

Considerations for Using the Data

Social Impact of Dataset

More Information Needed

Discussion of Biases

More Information Needed

Other Known Limitations

More Information Needed

Additional Information

Dataset Curators

More Information Needed

Licensing Information

Creative Commons Attribution 4.0 (CC-BY)

Citation Information

@inproceedings{derczynski2013twitter,
  title={Twitter part-of-speech tagging for all: Overcoming sparse and noisy data},
  author={Derczynski, Leon and Ritter, Alan and Clark, Sam and Bontcheva, Kalina},
  booktitle={Proceedings of the international conference recent advances in natural language processing ranlp 2013},
  pages={198--206},
  year={2013}
}

Contributions

Author uploaded ( @leondz )