模型:
sentence-transformers/multi-qa-distilbert-cos-v1
任务:
句子相似度数据集:
flax-sentence-embeddings/stackexchange_xml ms_marco gooaq yahoo_answers_topics search_qa eli5 natural_questions trivia_qa embedding-data/QQP embedding-data/PAQ_pairs embedding-data/Amazon-QA embedding-data/WikiAnswers 3Aembedding-data/WikiAnswers 3Aembedding-data/Amazon-QA 3Aembedding-data/PAQ_pairs 3Aembedding-data/QQP 3Atrivia_qa 3Anatural_questions 3Aeli5 3Asearch_qa 3Ayahoo_answers_topics 3Agooaq 3Ams_marco 3Aflax-sentence-embeddings/stackexchange_xmlThis is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and was designed for semantic search . It has been trained on 215M (question, answer) pairs from diverse sources. For an introduction to semantic search, have a look at: SBERT.net - Semantic Search
Using this model becomes easy when you have sentence-transformers installed:
pip install -U sentence-transformers
Then you can use the model like this:
from sentence_transformers import SentenceTransformer, util query = "How many people live in London?" docs = ["Around 9 Million people live in London", "London is known for its financial district"] #Load the model model = SentenceTransformer('sentence-transformers/multi-qa-distilbert-cos-v1') #Encode query and documents query_emb = model.encode(query) doc_emb = model.encode(docs) #Compute dot score between query and all document embeddings scores = util.dot_score(query_emb, doc_emb)[0].cpu().tolist() #Combine docs & scores doc_score_pairs = list(zip(docs, scores)) #Sort by decreasing score doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True) #Output passages & scores for doc, score in doc_score_pairs: print(score, doc)
Without sentence-transformers , you can use the model like this: First, you pass your input through the transformer model, then you have to apply the correct pooling-operation on-top of the contextualized word embeddings.
from transformers import AutoTokenizer, AutoModel import torch import torch.nn.functional as F #Mean Pooling - Take average of all tokens def mean_pooling(model_output, attention_mask): token_embeddings = model_output.last_hidden_state #First element of model_output contains all token embeddings input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) #Encode text def encode(texts): # Tokenize sentences encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt') # Compute token embeddings with torch.no_grad(): model_output = model(**encoded_input, return_dict=True) # Perform pooling embeddings = mean_pooling(model_output, encoded_input['attention_mask']) # Normalize embeddings embeddings = F.normalize(embeddings, p=2, dim=1) return embeddings # Sentences we want sentence embeddings for query = "How many people live in London?" docs = ["Around 9 Million people live in London", "London is known for its financial district"] # Load model from HuggingFace Hub tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/multi-qa-distilbert-cos-v1") model = AutoModel.from_pretrained("sentence-transformers/multi-qa-distilbert-cos-v1") #Encode query and docs query_emb = encode(query) doc_emb = encode(docs) #Compute dot score between query and all document embeddings scores = torch.mm(query_emb, doc_emb.transpose(0, 1))[0].cpu().tolist() #Combine docs & scores doc_score_pairs = list(zip(docs, scores)) #Sort by decreasing score doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True) #Output passages & scores for doc, score in doc_score_pairs: print(score, doc)
In the following some technical details how this model must be used:
Setting | Value |
---|---|
Dimensions | 768 |
Produces normalized embeddings | Yes |
Pooling-Method | Mean pooling |
Suitable score functions | dot-product ( util.dot_score ), cosine-similarity ( util.cos_sim ), or euclidean distance |
Note: When loaded with sentence-transformers , this model produces normalized embeddings with length 1. In that case, dot-product and cosine-similarity are equivalent. dot-product is preferred as it is faster. Euclidean distance is proportional to dot-product and can also be used.
The project aims to train sentence embedding models on very large sentence level datasets using a self-supervised contrastive learning objective. We use a contrastive learning objective: given a sentence from the pair, the model should predict which out of a set of randomly sampled other sentences, was actually paired with it in our dataset.
We developped this model during the Community week using JAX/Flax for NLP & CV , organized by Hugging Face. We developped this model as part of the project: Train the Best Sentence Embedding Model Ever with 1B Training Pairs . We benefited from efficient hardware infrastructure to run the project: 7 TPUs v3-8, as well as intervention from Googles Flax, JAX, and Cloud team member about efficient deep learning frameworks.
Our model is intented to be used for semantic search: It encodes queries / questions and text paragraphs in a dense vector space. It finds relevant documents for the given passages.
Note that there is a limit of 512 word pieces: Text longer than that will be truncated. Further note that the model was just trained on input text up to 250 word pieces. It might not work well for longer text.
The full training script is accessible in this current repository: train_script.py .
We use the pretrained distilbert-base-uncased model. Please refer to the model card for more detailed information about the pre-training procedure.
TrainingWe use the concatenation from multiple datasets to fine-tune our model. In total we have about 215M (question, answer) pairs. We sampled each dataset given a weighted probability which configuration is detailed in the data_config.json file.
The model was trained with MultipleNegativesRankingLoss using Mean-pooling, cosine-similarity as similarity function, and a scale of 20.
Dataset | Number of training tuples |
---|---|
WikiAnswers Duplicate question pairs from WikiAnswers | 77,427,422 |
PAQ Automatically generated (Question, Paragraph) pairs for each paragraph in Wikipedia | 64,371,441 |
Stack Exchange (Title, Body) pairs from all StackExchanges | 25,316,456 |
Stack Exchange (Title, Answer) pairs from all StackExchanges | 21,396,559 |
MS MARCO Triplets (query, answer, hard_negative) for 500k queries from Bing search engine | 17,579,773 |
GOOAQ: Open Question Answering with Diverse Answer Types (query, answer) pairs for 3M Google queries and Google featured snippet | 3,012,496 |
Amazon-QA (Question, Answer) pairs from Amazon product pages | 2,448,839 |
Yahoo Answers (Title, Answer) pairs from Yahoo Answers | 1,198,260 |
Yahoo Answers (Question, Answer) pairs from Yahoo Answers | 681,164 |
Yahoo Answers (Title, Question) pairs from Yahoo Answers | 659,896 |
SearchQA (Question, Answer) pairs for 140k questions, each with Top5 Google snippets on that question | 582,261 |
ELI5 (Question, Answer) pairs from Reddit ELI5 (explainlikeimfive) | 325,475 |
Stack Exchange Duplicate questions pairs (titles) | 304,525 |
Quora Question Triplets (Question, Duplicate_Question, Hard_Negative) triplets for Quora Questions Pairs dataset | 103,663 |
Natural Questions (NQ) (Question, Paragraph) pairs for 100k real Google queries with relevant Wikipedia paragraph | 100,231 |
SQuAD2.0 (Question, Paragraph) pairs from SQuAD2.0 dataset | 87,599 |
TriviaQA (Question, Evidence) pairs | 73,346 |
Total | 214,988,242 |