在XNLI数据集的法语部分上微调的Camembert-base模型。是少数在法语上工作的零射击分类模型之一??
两种不同的用法:
classifier = pipeline("zero-shot-classification", model="BaptisteDoyen/camembert-base-xnli") sequence = "L'équipe de France joue aujourd'hui au Parc des Princes" candidate_labels = ["sport","politique","science"] hypothesis_template = "Ce texte parle de {}." classifier(sequence, candidate_labels, hypothesis_template=hypothesis_template) # outputs : # {'sequence': "L'équipe de France joue aujourd'hui au Parc des Princes", # 'labels': ['sport', 'politique', 'science'], # 'scores': [0.8595073223114014, 0.10821866989135742, 0.0322740375995636]}
# load model and tokenizer nli_model = AutoModelForSequenceClassification.from_pretrained("BaptisteDoyen/camembert-base-xnli") tokenizer = AutoTokenizer.from_pretrained("BaptisteDoyen/camembert-base-xnli") # sequences premise = "le score pour les bleus est élevé" hypothesis = "L'équipe de France a fait un bon match" # tokenize and run through model x = tokenizer.encode(premise, hypothesis, return_tensors='pt') logits = nli_model(x)[0] # we throw away "neutral" (dim 1) and take the probability of # "entailment" (0) as the probability of the label being true entail_contradiction_logits = logits[:,::2] probs = entail_contradiction_logits.softmax(dim=1) prob_label_is_true = probs[:,0] prob_label_is_true[0].tolist() * 100 # outputs # 86.40775084495544
训练数据是Facebook于2018年发布的数据集的法语部分。可以很方便地使用datasets库获取:
from datasets import load_dataset dataset = load_dataset('xnli', 'fr')
训练过程非常基础,使用单个GPU在云上进行。主要训练参数:
我们在验证和测试集上获得以下结果:
Set | Accuracy |
---|---|
validation | 81.4 |
test | 81.7 |