模型:
sentence-transformers/multi-qa-MiniLM-L6-cos-v1
任务:
句子相似度数据集:
flax-sentence-embeddings/stackexchange_xml ms_marco gooaq yahoo_answers_topics search_qa eli5 natural_questions trivia_qa embedding-data/QQP embedding-data/PAQ_pairs embedding-data/Amazon-QA embedding-data/WikiAnswers 3Aembedding-data/WikiAnswers 3Aembedding-data/Amazon-QA 3Aembedding-data/PAQ_pairs 3Aembedding-data/QQP 3Atrivia_qa 3Anatural_questions 3Aeli5 3Asearch_qa 3Ayahoo_answers_topics 3Agooaq 3Ams_marco 3Aflax-sentence-embeddings/stackexchange_xml这是一个 sentence-transformers 模型:它将句子和段落映射到一个384维的稠密向量空间,并且专为语义搜索设计。它是在来自各种来源的215M个(问题,答案)对上进行训练的。要了解语义搜索的介绍,请查看: SBERT.net - Semantic Search
当您安装了 sentence-transformers 后,使用这个模型变得很容易:
pip install -U sentence-transformers
然后您可以像这样使用模型:
from sentence_transformers import SentenceTransformer, util query = "How many people live in London?" docs = ["Around 9 Million people live in London", "London is known for its financial district"] #Load the model model = SentenceTransformer('sentence-transformers/multi-qa-MiniLM-L6-cos-v1') #Encode query and documents query_emb = model.encode(query) doc_emb = model.encode(docs) #Compute dot score between query and all document embeddings scores = util.dot_score(query_emb, doc_emb)[0].cpu().tolist() #Combine docs & scores doc_score_pairs = list(zip(docs, scores)) #Sort by decreasing score doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True) #Output passages & scores for doc, score in doc_score_pairs: print(score, doc)
如果没有 sentence-transformers ,可以像下面这样使用模型:首先,将输入通过变换器模型,然后必须在上下文化的单词嵌入的顶部应用正确的池化操作。
from transformers import AutoTokenizer, AutoModel import torch import torch.nn.functional as F #Mean Pooling - Take average of all tokens def mean_pooling(model_output, attention_mask): token_embeddings = model_output.last_hidden_state input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) #Encode text def encode(texts): # Tokenize sentences encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt') # Compute token embeddings with torch.no_grad(): model_output = model(**encoded_input, return_dict=True) # Perform pooling embeddings = mean_pooling(model_output, encoded_input['attention_mask']) # Normalize embeddings embeddings = F.normalize(embeddings, p=2, dim=1) return embeddings # Sentences we want sentence embeddings for query = "How many people live in London?" docs = ["Around 9 Million people live in London", "London is known for its financial district"] # Load model from HuggingFace Hub tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/multi-qa-MiniLM-L6-cos-v1") model = AutoModel.from_pretrained("sentence-transformers/multi-qa-MiniLM-L6-cos-v1") #Encode query and docs query_emb = encode(query) doc_emb = encode(docs) #Compute dot score between query and all document embeddings scores = torch.mm(query_emb, doc_emb.transpose(0, 1))[0].cpu().tolist() #Combine docs & scores doc_score_pairs = list(zip(docs, scores)) #Sort by decreasing score doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True) #Output passages & scores for doc, score in doc_score_pairs: print(score, doc)
类似于上面的PyTorch示例,使用TensorFlow模型时,需要将输入通过变换器模型,然后必须在上下文化的单词嵌入的顶部应用正确的池化操作。
from transformers import AutoTokenizer, TFAutoModel import tensorflow as tf #Mean Pooling - Take attention mask into account for correct averaging def mean_pooling(model_output, attention_mask): token_embeddings = model_output.last_hidden_state input_mask_expanded = tf.cast(tf.tile(tf.expand_dims(attention_mask, -1), [1, 1, token_embeddings.shape[-1]]), tf.float32) return tf.math.reduce_sum(token_embeddings * input_mask_expanded, 1) / tf.math.maximum(tf.math.reduce_sum(input_mask_expanded, 1), 1e-9) #Encode text def encode(texts): # Tokenize sentences encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='tf') # Compute token embeddings model_output = model(**encoded_input, return_dict=True) # Perform pooling embeddings = mean_pooling(model_output, encoded_input['attention_mask']) # Normalize embeddings embeddings = tf.math.l2_normalize(embeddings, axis=1) return embeddings # Sentences we want sentence embeddings for query = "How many people live in London?" docs = ["Around 9 Million people live in London", "London is known for its financial district"] # Load model from HuggingFace Hub tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/multi-qa-MiniLM-L6-cos-v1") model = TFAutoModel.from_pretrained("sentence-transformers/multi-qa-MiniLM-L6-cos-v1") #Encode query and docs query_emb = encode(query) doc_emb = encode(docs) #Compute dot score between query and all document embeddings scores = (query_emb @ tf.transpose(doc_emb))[0].numpy().tolist() #Combine docs & scores doc_score_pairs = list(zip(docs, scores)) #Sort by decreasing score doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True) #Output passages & scores for doc, score in doc_score_pairs: print(score, doc)
以下是关于如何使用此模型的一些技术细节:
Setting | Value |
---|---|
Dimensions | 384 |
Produces normalized embeddings | Yes |
Pooling-Method | Mean pooling |
Suitable score functions | dot-product ( util.dot_score ), cosine-similarity ( util.cos_sim ), or euclidean distance |
注意:使用sentence-transformers加载后,此模型会产生长度为1的归一化嵌入。在这种情况下,点积和余弦相似度是等价的。由于点积运算速度更快,因此推荐使用点积。欧氏距离与点积成正比,也可以使用。
该项目旨在使用自监督的对比学习目标在非常大的句子级数据集上训练句子嵌入模型。我们使用对比学习目标:给定一对句子,模型应预测在我们的数据集中与其实际配对的一组随机取样的其他句子是哪一个。
在由Hugging Face组织的 Community week using JAX/Flax for NLP & CV 中,我们开发了这个模型。我们作为项目的一部分开发了这个模型: Train the Best Sentence Embedding Model Ever with 1B Training Pairs 。我们从谷歌的Flax、JAX和云团队成员的高效深度学习框架干预中受益,并获得了运行项目的高效硬件基础设施7个TPU v3-8。
我们的模型适用于语义搜索:它将查询/问题和文本段落编码为稠密向量空间。它找到与给定段落相关的文档。
请注意,有一个512个词块的限制:超过这个限制的文本将被截断。进一步注意,该模型只是在250个词块以内的输入文本上进行训练的。对于较长的文本可能效果不好。
完整的训练脚本可以在当前的仓库train_script.py中找到。
我们使用了预训练的 nreimers/MiniLM-L6-H384-uncased 模型。有关预训练过程的更多详细信息,请参阅模型卡。
训练我们使用多个数据集的连接来微调我们的模型。总共有约215M个(问题,答案)对。我们根据在data_config.json文件中详细说明的加权概率对每个数据集进行抽样。
该模型使用 MultipleNegativesRankingLoss 进行训练,使用Mean-pooling作为相似度函数,并具有20的缩放比例。
Dataset | Number of training tuples |
---|---|
12312321 Duplicate question pairs from WikiAnswers | 77,427,422 |
12313321 Automatically generated (Question, Paragraph) pairs for each paragraph in Wikipedia | 64,371,441 |
12314321 (Title, Body) pairs from all StackExchanges | 25,316,456 |
12314321 (Title, Answer) pairs from all StackExchanges | 21,396,559 |
12316321 Triplets (query, answer, hard_negative) for 500k queries from Bing search engine | 17,579,773 |
12317321 (query, answer) pairs for 3M Google queries and Google featured snippet | 3,012,496 |
12318321 (Question, Answer) pairs from Amazon product pages | 2,448,839 |
12319321 (Title, Answer) pairs from Yahoo Answers | 1,198,260 |
12319321 (Question, Answer) pairs from Yahoo Answers | 681,164 |
12319321 (Title, Question) pairs from Yahoo Answers | 659,896 |
12322321 (Question, Answer) pairs for 140k questions, each with Top5 Google snippets on that question | 582,261 |
12323321 (Question, Answer) pairs from Reddit ELI5 (explainlikeimfive) | 325,475 |
12314321 Duplicate questions pairs (titles) | 304,525 |
12325321 (Question, Duplicate_Question, Hard_Negative) triplets for Quora Questions Pairs dataset | 103,663 |
12326321 (Question, Paragraph) pairs for 100k real Google queries with relevant Wikipedia paragraph | 100,231 |
12327321 (Question, Paragraph) pairs from SQuAD2.0 dataset | 87,599 |
12328321 (Question, Evidence) pairs | 73,346 |
Total | 214,988,242 |