模型:
facebook/contriever-msmarco
This model is the finetuned version of the pre-trained contriever model available here https://huggingface.co/facebook/contriever , following the approach described in Towards Unsupervised Dense Information Retrieval with Contrastive Learning . The associated GitHub repository is available here https://github.com/facebookresearch/contriever .
Using the model directly available in HuggingFace transformers requires to add a mean pooling operation to obtain a sentence embedding.
import torch from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained('facebook/contriever-msmarco') model = AutoModel.from_pretrained('facebook/contriever-msmarco') sentences = [ "Where was Marie Curie born?", "Maria Sklodowska, later known as Marie Curie, was born on November 7, 1867.", "Born in Paris on 15 May 1859, Pierre Curie was the son of Eugène Curie, a doctor of French Catholic origin from Alsace." ] # Apply tokenizer inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') # Compute token embeddings outputs = model(**inputs) # Mean pooling def mean_pooling(token_embeddings, mask): token_embeddings = token_embeddings.masked_fill(~mask[..., None].bool(), 0.) sentence_embeddings = token_embeddings.sum(dim=1) / mask.sum(dim=1)[..., None] return sentence_embeddings embeddings = mean_pooling(outputs[0], inputs['attention_mask'])