模型:
cross-encoder/mmarco-mMiniLMv2-L12-H384-v1
该模型是在 MMARCO 数据集上训练的。它是使用Google翻译对MS MARCO进行的机器翻译版本。它被翻译成了14种语言。在我们的实验中,我们观察到它在其他语言方面也表现良好。
作为基础模型,我们使用了 multilingual MiniLMv2 模型。
该模型可用于信息检索:给定一个查询,用所有可能的段落对查询进行编码(例如用ElasticSearch检索到的段落)。然后按降序对段落进行排序。更多详细信息请参见 SBERT.net Retrieve & Re-rank 。训练代码在这里可用: SBERT.net Training MS Marco
如果您已安装了 SentenceTransformers ,则使用预训练模型非常简单:
from sentence_transformers import CrossEncoder model = CrossEncoder('model_name') scores = model.predict([('Query', 'Paragraph1'), ('Query', 'Paragraph2') , ('Query', 'Paragraph3')])
from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch model = AutoModelForSequenceClassification.from_pretrained('model_name') tokenizer = AutoTokenizer.from_pretrained('model_name') features = tokenizer(['How many people live in Berlin?', 'How many people live in Berlin?'], ['Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.', 'New York City is famous for the Metropolitan Museum of Art.'], padding=True, truncation=True, return_tensors="pt") model.eval() with torch.no_grad(): scores = model(**features).logits print(scores)