模型:

BaptisteDoyen/camembert-base-xnli

英文

camembert-base-xnli

模型描述

在XNLI数据集的法语部分上微调的Camembert-base模型。是少数在法语上工作的零射击分类模型之一??

预期用途和限制

如何使用

两种不同的用法:

  • 作为零射击序列分类器:
classifier = pipeline("zero-shot-classification", 
                      model="BaptisteDoyen/camembert-base-xnli")

sequence = "L'équipe de France joue aujourd'hui au Parc des Princes"
candidate_labels = ["sport","politique","science"]
hypothesis_template = "Ce texte parle de {}."    

classifier(sequence, candidate_labels, hypothesis_template=hypothesis_template)     
# outputs :                                        
# {'sequence': "L'équipe de France joue aujourd'hui au Parc des Princes",
# 'labels': ['sport', 'politique', 'science'],
# 'scores': [0.8595073223114014, 0.10821866989135742, 0.0322740375995636]}                      
  • 作为前提/假设检查器:这里的想法是计算形式为P(前提|假设)的概率
# load model and tokenizer
nli_model = AutoModelForSequenceClassification.from_pretrained("BaptisteDoyen/camembert-base-xnli")
tokenizer = AutoTokenizer.from_pretrained("BaptisteDoyen/camembert-base-xnli") 
# sequences
premise = "le score pour les bleus est élevé"
hypothesis = "L'équipe de France a fait un bon match"
# tokenize and run through model
x = tokenizer.encode(premise, hypothesis, return_tensors='pt')
logits = nli_model(x)[0]
# we throw away "neutral" (dim 1) and take the probability of
# "entailment" (0) as the probability of the label being true 
entail_contradiction_logits = logits[:,::2]
probs = entail_contradiction_logits.softmax(dim=1)
prob_label_is_true = probs[:,0]
prob_label_is_true[0].tolist() * 100
# outputs
# 86.40775084495544

训练数据

训练数据是Facebook于2018年发布的数据集的法语部分。可以很方便地使用datasets库获取:

from datasets import load_dataset
dataset = load_dataset('xnli', 'fr')                     

训练/微调过程

训练过程非常基础,使用单个GPU在云上进行。主要训练参数:

  • lr = 2e-5,lr_scheduler_type = "linear"
  • num_train_epochs = 4
  • batch_size = 12(受限于GPU内存)
  • weight_decay = 0.01
  • metric_for_best_model = "eval_accuracy"

评估结果

我们在验证和测试集上获得以下结果:

Set Accuracy
validation 81.4
test 81.7