[roberta-large-ner-english] is an english NER model that was fine-tuned from roberta-large on conll2003 dataset. Model was validated on emails/chat data and outperformed other models on this type of data specifically. In particular the model seems to work better on entity that don't start with an upper case.
Training data was classified as follow:
Abbreviation | Description |
---|---|
O | Outside of a named entity |
MISC | Miscellaneous entity |
PER | Person’s name |
ORG | Organization |
LOC | Location |
In order to simplify, the prefix B- or I- from original conll2003 was removed. I used the train and test dataset from original conll2003 for training and the "validation" dataset for validation. This resulted in a dataset of size:
Train | Validation |
---|---|
17494 | 3250 |
from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("Jean-Baptiste/roberta-large-ner-english") model = AutoModelForTokenClassification.from_pretrained("Jean-Baptiste/roberta-large-ner-english") ##### Process text sample (from wikipedia) from transformers import pipeline nlp = pipeline('ner', model=model, tokenizer=tokenizer, aggregation_strategy="simple") nlp("Apple was founded in 1976 by Steve Jobs, Steve Wozniak and Ronald Wayne to develop and sell Wozniak's Apple I personal computer") [{'entity_group': 'ORG', 'score': 0.99381506, 'word': ' Apple', 'start': 0, 'end': 5}, {'entity_group': 'PER', 'score': 0.99970853, 'word': ' Steve Jobs', 'start': 29, 'end': 39}, {'entity_group': 'PER', 'score': 0.99981767, 'word': ' Steve Wozniak', 'start': 41, 'end': 54}, {'entity_group': 'PER', 'score': 0.99956465, 'word': ' Ronald Wayne', 'start': 59, 'end': 71}, {'entity_group': 'PER', 'score': 0.9997918, 'word': ' Wozniak', 'start': 92, 'end': 99}, {'entity_group': 'MISC', 'score': 0.99956393, 'word': ' Apple I', 'start': 102, 'end': 109}]
Model performances computed on conll2003 validation dataset (computed on the tokens predictions)
entity | precision | recall | f1 |
---|---|---|---|
PER | 0.9914 | 0.9927 | 0.9920 |
ORG | 0.9627 | 0.9661 | 0.9644 |
LOC | 0.9795 | 0.9862 | 0.9828 |
MISC | 0.9292 | 0.9262 | 0.9277 |
Overall | 0.9740 | 0.9766 | 0.9753 |
On private dataset (email, chat, informal discussion), computed on word predictions:
entity | precision | recall | f1 |
---|---|---|---|
PER | 0.8823 | 0.9116 | 0.8967 |
ORG | 0.7694 | 0.7292 | 0.7487 |
LOC | 0.8619 | 0.7768 | 0.8171 |
By comparison on the same private dataset, Spacy (en_core_web_trf-3.2.0) was giving:
entity | precision | recall | f1 |
---|---|---|---|
PER | 0.9146 | 0.8287 | 0.8695 |
ORG | 0.7655 | 0.6437 | 0.6993 |
LOC | 0.8727 | 0.6180 | 0.7236 |
For those who could be interested, here is a short article on how I used the results of this model to train a LSTM model for signature detection in emails: https://medium.com/@jean-baptiste.polle/lstm-model-for-email-signature-detection-8e990384fefa