DistilCamemBERT-Sentiment

我们介绍了DistilCamemBERT-Sentiment，它是针对法语情感分析任务进行微调的模型。该模型使用了两个数据集： Amazon Reviews 和 Allociné.fr ，以减少偏见。事实上，亚马逊评论类似于信息短且相似，而Allociné评论则是长篇且富有的文本。

这种模型化与基于 CamemBERT 模型的 tblard/tf-allocine 相似。基于CamemBERT的模型化问题在于扩展时的生产阶段，例如，推理成本可能是一个技术问题。为了对抗这种影响，我们提出了这种模型化方法，它通过 DistilCamemBERT 将推理时间减少了一半，并带有相同的功耗。

数据集

该数据集包括204,993个亚马逊评论的训练样本和4,999个测试样本，以及来自 Allocine website 的235,516个和4,729个评论。数据集被标记为五个类别：

1星：表示糟糕的评价，
2星：差评，
3星：中性评价，
4星：好评，
5星：极好的评价。

评估结果

除了准确度（在这里称为准确精度）以便对+/-1星估计误差具有鲁棒性外，我们将采用以下定义作为一种性能度量：

top-2 acc = (1 / |O|) * ∑i ∈ O ∑0 ≤ l < 2 1(f̂i,l = yi)

其中f̂l是第l个预测标签，y是真实标签，O是测试集观测值，1是指示函数。

class	exact accuracy (%)	top-2 acc (%)	support
global	61.01	88.80	9,698
1 star	87.21	77.17	1,905
2 stars	79.19	84.75	1,935
3 stars	77.85	78.98	1,974
4 stars	78.61	90.22	1,952
5 stars	85.96	82.92	1,932

基准性能

该模型与3个参考模型进行了比较（请参见下文）。由于每个模型都没有确切的目标定义，我们详细说明了每个模型使用的性能度量。平均推理时间使用了一台配置为AMD Ryzen 5 4500U @ 2.3GHz，具有6个核心的计算机。

bert-base-multilingual-uncased-sentiment

nlptown/bert-base-multilingual-uncased-sentiment 是基于BERT模型的多语言和无大小写版本。这个情感分析器是根据亚马逊评论进行训练的，与我们的模型类似。因此，目标及其定义是相同的。

model	time (ms)	exact accuracy (%)	top-2 acc (%)
12311321	95.56	61.01	88.80
12312321	187.70	54.41	82.82

tf-allociné和barthez-sentiment-classification

基于 CamemBERT 模型的 tblard/tf-allocine 以及基于 BARThez 的 moussaKam/barthez-sentiment-classification 使用相同的双分类定义。为了将其转化为两类问题，我们只考虑"1星"和"2星"标签作为负面情感，以及"4星"和"5星"作为积极情感。我们排除了"3星"，因为它可以被解释为中性类别。在这种情况下，+/-1星估计误差的问题消失了。然后我们仅使用经典的准确度定义。

model	time (ms)	exact accuracy (%)
12311321	95.56	97.52
12318321	329.74	95.69
12319321	197.95	94.29

如何使用DistilCamemBERT-Sentiment

from transformers import pipeline

analyzer = pipeline(
    task='text-classification',
    model="cmarkea/distilcamembert-base-sentiment",
    tokenizer="cmarkea/distilcamembert-base-sentiment"
)
result = analyzer(
    "J'aime me promener en forêt même si ça me donne mal aux pieds.",
    return_all_scores=True
)

result
[{'label': '1 star',
  'score': 0.047529436647892},
 {'label': '2 stars',
  'score': 0.14150355756282806},
 {'label': '3 stars',
  'score': 0.3586442470550537},
 {'label': '4 stars',
  'score': 0.3181498646736145},
 {'label': '5 stars',
  'score': 0.13417290151119232}]

Optimum + ONNX

from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer, pipeline

HUB_MODEL = "cmarkea/distilcamembert-base-sentiment"

tokenizer = AutoTokenizer.from_pretrained(HUB_MODEL)
model = ORTModelForSequenceClassification.from_pretrained(HUB_MODEL)
onnx_qa = pipeline("text-classification", model=model, tokenizer=tokenizer)

# Quantized onnx model
quantized_model = ORTModelForSequenceClassification.from_pretrained(
    HUB_MODEL, file_name="model_quantized.onnx"
)

引用

@inproceedings{delestre:hal-03674695,
  TITLE = {{DistilCamemBERT : une distillation du mod{\`e}le fran{\c c}ais CamemBERT}},
  AUTHOR = {Delestre, Cyrile and Amar, Abibatou},
  URL = {https://hal.archives-ouvertes.fr/hal-03674695},
  BOOKTITLE = {{CAp (Conf{\'e}rence sur l'Apprentissage automatique)}},
  ADDRESS = {Vannes, France},
  YEAR = {2022},
  MONTH = Jul,
  KEYWORDS = {NLP ; Transformers ; CamemBERT ; Distillation},
  PDF = {https://hal.archives-ouvertes.fr/hal-03674695/file/cap2022.pdf},
  HAL_ID = {hal-03674695},
  HAL_VERSION = {v1},
}

作者:

Credit Mutuel Arkea

数据集大小:

780.35 MB