数据集:
tner/tweetner7
This is the official repository of TweetNER7 ( "Named Entity Recognition in Twitter: A Dataset and Analysis on Short-Term Temporal Shifts, AACL main conference 2022" ), an NER dataset on Twitter with 7 entity labels. Each instance of TweetNER7 comes with a timestamp which distributes from September 2019 to August 2021. The tweet collection used in TweetNER7 is same as what used in TweetTopic . The dataset is integrated in TweetNLP too.
We pre-process tweets before the annotation to normalize some artifacts, converting URLs into a special token {{URL}} and non-verified usernames into {{USERNAME}} . For verified usernames, we replace its display name (or account name) with symbols {@} . For example, a tweet
Get the all-analog Classic Vinyl Edition of "Takin' Off" Album from @herbiehancock via @bluenoterecords link below: http://bluenote.lnk.to/AlbumOfTheWeek
is transformed into the following text.
Get the all-analog Classic Vinyl Edition of "Takin' Off" Album from {@herbiehancock@} via {@bluenoterecords@} link below: {{URL}}
A simple function to format tweet follows below.
import re from urlextract import URLExtract extractor = URLExtract() def format_tweet(tweet): # mask web urls urls = extractor.find_urls(tweet) for url in urls: tweet = tweet.replace(url, "{{URL}}") # format twitter account tweet = re.sub(r"\b(\s*)(@[\S]+)\b", r'\1{\2@}', tweet) return tweet target = """Get the all-analog Classic Vinyl Edition of "Takin' Off" Album from @herbiehancock via @bluenoterecords link below: http://bluenote.lnk.to/AlbumOfTheWeek""" target_format = format_tweet(target) print(target_format) 'Get the all-analog Classic Vinyl Edition of "Takin\' Off" Album from {@herbiehancock@} via {@bluenoterecords@} link below: {{URL}}'
We ask annotators to ignore those special tokens but label the verified users' mentions.
split | number of instances | description |
---|---|---|
train_2020 | 4616 | training dataset from September 2019 to August 2020 |
train_2021 | 2495 | training dataset from September 2020 to August 2021 |
train_all | 7111 | combined training dataset of train_2020 and train_2021 |
validation_2020 | 576 | validation dataset from September 2019 to August 2020 |
validation_2021 | 310 | validation dataset from September 2020 to August 2021 |
test_2020 | 576 | test dataset from September 2019 to August 2020 |
test_2021 | 2807 | test dataset from September 2020 to August 2021 |
train_random | 4616 | randomly sampled training dataset with the same size as train_2020 from train_all |
validation_random | 576 | randomly sampled training dataset with the same size as validation_2020 from validation_all |
extra_2020 | 87880 | extra tweet without annotations from September 2019 to August 2020 |
extra_2021 | 93594 | extra tweet without annotations from September 2020 to August 2021 |
For the temporal-shift setting, model should be trained on train_2020 with validation_2020 and evaluate on test_2021 . In general, model would be trained on train_all , the most representative training set with validation_2021 and evaluate on test_2021 .
An example of train looks as follows.
{ 'tokens': ['Morning', '5km', 'run', 'with', '{{USERNAME}}', 'for', 'breast', 'cancer', 'awareness', '#', 'pinkoctober', '#', 'breastcancerawareness', '#', 'zalorafit', '#', 'zalorafitxbnwrc', '@', 'The', 'Central', 'Park', ',', 'Desa', 'Parkcity', '{{URL}}'], 'tags': [14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 2, 14, 2, 14, 14, 14, 14, 14, 14, 4, 11, 11, 11, 11, 14], 'id': '1183344337016381440', 'date': '2019-10-13' }
The label2id dictionary can be found at here .
{ "B-corporation": 0, "B-creative_work": 1, "B-event": 2, "B-group": 3, "B-location": 4, "B-person": 5, "B-product": 6, "I-corporation": 7, "I-creative_work": 8, "I-event": 9, "I-group": 10, "I-location": 11, "I-person": 12, "I-product": 13, "O": 14 }
See full evaluation metrics here .
Model description follows below.
Model (link) | Data | Language Model | Micro F1 (2021) | Macro F1 (2021) |
---|---|---|---|---|
tner/roberta-large-tweetner7-random | tweetner7 | roberta-large | 66.33 | 60.96 |
tner/twitter-roberta-base-2019-90m-tweetner7-random | tweetner7 | cardiffnlp/twitter-roberta-base-2019-90m | 63.29 | 58.5 |
tner/roberta-base-tweetner7-random | tweetner7 | roberta-base | 64.04 | 59.23 |
tner/twitter-roberta-base-dec2020-tweetner7-random | tweetner7 | cardiffnlp/twitter-roberta-base-dec2020 | 64.72 | 59.97 |
tner/bertweet-large-tweetner7-random | tweetner7 | cardiffnlp/twitter-roberta-base-dec2021vinai/bertweet-large | 64.86 | 60.49 |
tner/bertweet-base-tweetner7-random | tweetner7 | vinai/bertweet-base | 65.55 | 59.58 |
tner/bert-large-tweetner7-random | tweetner7 | bert-large | 62.39 | 57.54 |
tner/bert-base-tweetner7-random | tweetner7 | bert-base | 60.91 | 55.92 |
Model (link) | Data | Language Model | Micro F1 (2021) | Macro F1 (2021) |
---|---|---|---|---|
tner/roberta-large-tweetner7-selflabel2020 | tweetner7 | roberta-large | 64.56 | 59.63 |
tner/roberta-large-tweetner7-selflabel2021 | tweetner7 | roberta-large | 64.6 | 59.45 |
tner/roberta-large-tweetner7-2020-selflabel2020-all | tweetner7 | roberta-large | 65.46 | 60.39 |
tner/roberta-large-tweetner7-2020-selflabel2021-all | tweetner7 | roberta-large | 64.52 | 59.45 |
tner/roberta-large-tweetner7-selflabel2020-continuous | tweetner7 | roberta-large | 65.15 | 60.23 |
tner/roberta-large-tweetner7-selflabel2021-continuous | tweetner7 | roberta-large | 64.48 | 59.41 |
Model description follows below.
To reproduce the experimental result on our AACL paper, please see the repository https://github.com/asahi417/tner/tree/master/examples/tweetner7_paper .
@inproceedings{ushio-etal-2022-tweet, title = "{N}amed {E}ntity {R}ecognition in {T}witter: {A} {D}ataset and {A}nalysis on {S}hort-{T}erm {T}emporal {S}hifts", author = "Ushio, Asahi and Neves, Leonardo and Silva, Vitor and Barbieri, Francesco. and Camacho-Collados, Jose", booktitle = "The 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing", month = nov, year = "2022", address = "Online", publisher = "Association for Computational Linguistics", }