模型:
facebook/spar-wiki-bm25-lexmodel-query-encoder
This model is the query encoder of the Wiki BM25 Lexical Model (Λ) from the SPAR paper:
Salient Phrase Aware Dense Retrieval: Can a Dense Retriever Imitate a Sparse One? Xilun Chen, Kushal Lakhotia, Barlas Oğuz, Anchit Gupta, Patrick Lewis, Stan Peshterliev, Yashar Mehdad, Sonal Gupta and Wen-tau Yih Meta AI
The associated github repo is available here: https://github.com/facebookresearch/dpr-scale/tree/main/spar
This model is a BERT-base sized dense retriever trained on Wikipedia articles to imitate the behavior of BM25. The following models are also available:
Pretrained Model | Corpus | Teacher | Architecture | Query Encoder Path | Context Encoder Path |
---|---|---|---|---|---|
Wiki BM25 Λ | Wikipedia | BM25 | BERT-base | facebook/spar-wiki-bm25-lexmodel-query-encoder | facebook/spar-wiki-bm25-lexmodel-context-encoder |
PAQ BM25 Λ | PAQ | BM25 | BERT-base | facebook/spar-paq-bm25-lexmodel-query-encoder | facebook/spar-paq-bm25-lexmodel-context-encoder |
MARCO BM25 Λ | MS MARCO | BM25 | BERT-base | facebook/spar-marco-bm25-lexmodel-query-encoder | facebook/spar-marco-bm25-lexmodel-context-encoder |
MARCO UniCOIL Λ | MS MARCO | UniCOIL | BERT-base | facebook/spar-marco-unicoil-lexmodel-query-encoder | facebook/spar-marco-unicoil-lexmodel-context-encoder |
This model should be used together with the associated context encoder, similar to the DPR model.
import torch from transformers import AutoTokenizer, AutoModel # The tokenizer is the same for the query and context encoder tokenizer = AutoTokenizer.from_pretrained('facebook/spar-wiki-bm25-lexmodel-query-encoder') query_encoder = AutoModel.from_pretrained('facebook/spar-wiki-bm25-lexmodel-query-encoder') context_encoder = AutoModel.from_pretrained('facebook/spar-wiki-bm25-lexmodel-context-encoder') query = "Where was Marie Curie born?" contexts = [ "Maria Sklodowska, later known as Marie Curie, was born on November 7, 1867.", "Born in Paris on 15 May 1859, Pierre Curie was the son of Eugène Curie, a doctor of French Catholic origin from Alsace." ] # Apply tokenizer query_input = tokenizer(query, return_tensors='pt') ctx_input = tokenizer(contexts, padding=True, truncation=True, return_tensors='pt') # Compute embeddings: take the last-layer hidden state of the [CLS] token query_emb = query_encoder(**query_input).last_hidden_state[:, 0, :] ctx_emb = context_encoder(**ctx_input).last_hidden_state[:, 0, :] # Compute similarity scores using dot product score1 = query_emb @ ctx_emb[0] # 341.3268 score2 = query_emb @ ctx_emb[1] # 340.1626
As Λ learns lexical matching from a sparse teacher retriever, it can be used in combination with a standard dense retriever (e.g. DPR , Contriever ) to build a dense retriever that excels at both lexical and semantic matching.
In the following example, we show how to build the SPAR-Wiki model for Open-Domain Question Answering by concatenating the embeddings of DPR and the Wiki BM25 Λ.
import torch from transformers import AutoTokenizer, AutoModel from transformers import DPRQuestionEncoder, DPRQuestionEncoderTokenizer from transformers import DPRContextEncoder, DPRContextEncoderTokenizer # DPR model dpr_ctx_tokenizer = DPRContextEncoderTokenizer.from_pretrained("facebook/dpr-ctx_encoder-multiset-base") dpr_ctx_encoder = DPRContextEncoder.from_pretrained("facebook/dpr-ctx_encoder-multiset-base") dpr_query_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained("facebook/dpr-question_encoder-multiset-base") dpr_query_encoder = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-multiset-base") # Wiki BM25 Λ model lexmodel_tokenizer = AutoTokenizer.from_pretrained('facebook/spar-wiki-bm25-lexmodel-query-encoder') lexmodel_query_encoder = AutoModel.from_pretrained('facebook/spar-wiki-bm25-lexmodel-query-encoder') lexmodel_context_encoder = AutoModel.from_pretrained('facebook/spar-wiki-bm25-lexmodel-context-encoder') query = "Where was Marie Curie born?" contexts = [ "Maria Sklodowska, later known as Marie Curie, was born on November 7, 1867.", "Born in Paris on 15 May 1859, Pierre Curie was the son of Eugène Curie, a doctor of French Catholic origin from Alsace." ] # Compute DPR embeddings dpr_query_input = dpr_query_tokenizer(query, return_tensors='pt')['input_ids'] dpr_query_emb = dpr_query_encoder(dpr_query_input).pooler_output dpr_ctx_input = dpr_ctx_tokenizer(contexts, padding=True, truncation=True, return_tensors='pt') dpr_ctx_emb = dpr_ctx_encoder(**dpr_ctx_input).pooler_output # Compute Λ embeddings lexmodel_query_input = lexmodel_tokenizer(query, return_tensors='pt') lexmodel_query_emb = lexmodel_query_encoder(**query_input).last_hidden_state[:, 0, :] lexmodel_ctx_input = lexmodel_tokenizer(contexts, padding=True, truncation=True, return_tensors='pt') lexmodel_ctx_emb = lexmodel_context_encoder(**ctx_input).last_hidden_state[:, 0, :] # Form SPAR embeddings via concatenation # The concatenation weight is only applied to query embeddings # Refer to the SPAR paper for details concat_weight = 0.7 spar_query_emb = torch.cat( [dpr_query_emb, concat_weight * lexmodel_query_emb], dim=-1, ) spar_ctx_emb = torch.cat( [dpr_ctx_emb, lexmodel_ctx_emb], dim=-1, ) # Compute similarity scores score1 = spar_query_emb @ spar_ctx_emb[0] # 317.6931 score2 = spar_query_emb @ spar_ctx_emb[1] # 314.6144