We present DistilCamemBERT-NER, which is DistilCamemBERT fine-tuned for the NER (Named Entity Recognition) task for the French language. The work is inspired by Jean-Baptiste/camembert-ner based on the CamemBERT model. The problem of the modelizations based on CamemBERT is at the scaling moment, for the production phase, for example. Indeed, inference cost can be a technological issue. To counteract this effect, we propose this modelization which divides the inference time by two with the same consumption power thanks to DistilCamemBERT .
The dataset used is wikiner_fr , which represents ~170k sentences labeled in 5 categories :
class | precision (%) | recall (%) | f1 (%) | support (#sub-word) |
---|---|---|---|---|
global | 98.17 | 98.19 | 98.18 | 378,776 |
PER | 96.78 | 96.87 | 96.82 | 23,754 |
LOC | 94.05 | 93.59 | 93.82 | 27,196 |
ORG | 86.05 | 85.92 | 85.98 | 6,526 |
MISC | 88.78 | 84.69 | 86.69 | 11,891 |
O | 99.26 | 99.47 | 99.37 | 309,409 |
This model performance is compared to 2 reference models (see below) with the metric f1 score. For the mean inference time measure, an AMD Ryzen 5 4500U @ 2.3GHz with 6 cores was used:
model | time (ms) | PER (%) | LOC (%) | ORG (%) | MISC (%) | O (%) |
---|---|---|---|---|---|---|
cmarkea/distilcamembert-base-ner | 43.44 | 96.82 | 93.82 | 85.98 | 86.69 | 99.37 |
Davlan/bert-base-multilingual-cased-ner-hrl | 87.56 | 79.93 | 72.89 | 61.34 | n/a | 96.04 |
flair/ner-french | 314.96 | 82.91 | 76.17 | 70.96 | 76.29 | 97.65 |
from transformers import pipeline ner = pipeline( task='ner', model="cmarkea/distilcamembert-base-ner", tokenizer="cmarkea/distilcamembert-base-ner", aggregation_strategy="simple" ) result = ner( "Le Crédit Mutuel Arkéa est une banque Française, elle comprend le CMB " "qui est une banque située en Bretagne et le CMSO qui est une banque " "qui se situe principalement en Aquitaine. C'est sous la présidence de " "Louis Lichou, dans les années 1980 que différentes filiales sont créées " "au sein du CMB et forment les principales filiales du groupe qui " "existent encore aujourd'hui (Federal Finance, Suravenir, Financo, etc.)." ) result [{'entity_group': 'ORG', 'score': 0.9974479, 'word': 'Crédit Mutuel Arkéa', 'start': 3, 'end': 22}, {'entity_group': 'LOC', 'score': 0.9000358, 'word': 'Française', 'start': 38, 'end': 47}, {'entity_group': 'ORG', 'score': 0.9788757, 'word': 'CMB', 'start': 66, 'end': 69}, {'entity_group': 'LOC', 'score': 0.99919766, 'word': 'Bretagne', 'start': 99, 'end': 107}, {'entity_group': 'ORG', 'score': 0.9594884, 'word': 'CMSO', 'start': 114, 'end': 118}, {'entity_group': 'LOC', 'score': 0.99935514, 'word': 'Aquitaine', 'start': 169, 'end': 178}, {'entity_group': 'PER', 'score': 0.99911094, 'word': 'Louis Lichou', 'start': 208, 'end': 220}, {'entity_group': 'ORG', 'score': 0.96226394, 'word': 'CMB', 'start': 291, 'end': 294}, {'entity_group': 'ORG', 'score': 0.9983959, 'word': 'Federal Finance', 'start': 374, 'end': 389}, {'entity_group': 'ORG', 'score': 0.9984454, 'word': 'Suravenir', 'start': 391, 'end': 400}, {'entity_group': 'ORG', 'score': 0.9985084, 'word': 'Financo', 'start': 402, 'end': 409}]
from optimum.onnxruntime import ORTModelForTokenClassification from transformers import AutoTokenizer, pipeline HUB_MODEL = "cmarkea/distilcamembert-base-nli" tokenizer = AutoTokenizer.from_pretrained(HUB_MODEL) model = ORTModelForTokenClassification.from_pretrained(HUB_MODEL) onnx_qa = pipeline("token-classification", model=model, tokenizer=tokenizer) # Quantized onnx model quantized_model = ORTModelForTokenClassification.from_pretrained( HUB_MODEL, file_name="model_quantized.onnx" )
@inproceedings{delestre:hal-03674695, TITLE = {{DistilCamemBERT : une distillation du mod{\`e}le fran{\c c}ais CamemBERT}}, AUTHOR = {Delestre, Cyrile and Amar, Abibatou}, URL = {https://hal.archives-ouvertes.fr/hal-03674695}, BOOKTITLE = {{CAp (Conf{\'e}rence sur l'Apprentissage automatique)}}, ADDRESS = {Vannes, France}, YEAR = {2022}, MONTH = Jul, KEYWORDS = {NLP ; Transformers ; CamemBERT ; Distillation}, PDF = {https://hal.archives-ouvertes.fr/hal-03674695/file/cap2022.pdf}, HAL_ID = {hal-03674695}, HAL_VERSION = {v1}, }