英文

multi-qa-MiniLM-L6-cos-v1

这是一个 sentence-transformers 模型:它将句子和段落映射到一个384维的稠密向量空间,并且专为语义搜索设计。它是在来自各种来源的215M个(问题,答案)对上进行训练的。要了解语义搜索的介绍,请查看: SBERT.net - Semantic Search

用法(Sentence-Transformers)

当您安装了 sentence-transformers 后,使用这个模型变得很容易:

pip install -U sentence-transformers

然后您可以像这样使用模型:

from sentence_transformers import SentenceTransformer, util

query = "How many people live in London?"
docs = ["Around 9 Million people live in London", "London is known for its financial district"]

#Load the model
model = SentenceTransformer('sentence-transformers/multi-qa-MiniLM-L6-cos-v1')

#Encode query and documents
query_emb = model.encode(query)
doc_emb = model.encode(docs)

#Compute dot score between query and all document embeddings
scores = util.dot_score(query_emb, doc_emb)[0].cpu().tolist()

#Combine docs & scores
doc_score_pairs = list(zip(docs, scores))

#Sort by decreasing score
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)

#Output passages & scores
for doc, score in doc_score_pairs:
    print(score, doc)

PyTorch用法(HuggingFace Transformers)

如果没有 sentence-transformers ,可以像下面这样使用模型:首先,将输入通过变换器模型,然后必须在上下文化的单词嵌入的顶部应用正确的池化操作。

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

#Mean Pooling - Take average of all tokens
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output.last_hidden_state
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


#Encode text
def encode(texts):
    # Tokenize sentences
    encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')

    # Compute token embeddings
    with torch.no_grad():
        model_output = model(**encoded_input, return_dict=True)

    # Perform pooling
    embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

    # Normalize embeddings
    embeddings = F.normalize(embeddings, p=2, dim=1)
    
    return embeddings


# Sentences we want sentence embeddings for
query = "How many people live in London?"
docs = ["Around 9 Million people live in London", "London is known for its financial district"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/multi-qa-MiniLM-L6-cos-v1")
model = AutoModel.from_pretrained("sentence-transformers/multi-qa-MiniLM-L6-cos-v1")

#Encode query and docs
query_emb = encode(query)
doc_emb = encode(docs)

#Compute dot score between query and all document embeddings
scores = torch.mm(query_emb, doc_emb.transpose(0, 1))[0].cpu().tolist()

#Combine docs & scores
doc_score_pairs = list(zip(docs, scores))

#Sort by decreasing score
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)

#Output passages & scores
for doc, score in doc_score_pairs:
    print(score, doc)

TensorFlow用法(HuggingFace Transformers)

类似于上面的PyTorch示例,使用TensorFlow模型时,需要将输入通过变换器模型,然后必须在上下文化的单词嵌入的顶部应用正确的池化操作。

from transformers import AutoTokenizer, TFAutoModel
import tensorflow as tf

#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output.last_hidden_state
    input_mask_expanded = tf.cast(tf.tile(tf.expand_dims(attention_mask, -1), [1, 1, token_embeddings.shape[-1]]), tf.float32)
    return tf.math.reduce_sum(token_embeddings * input_mask_expanded, 1) / tf.math.maximum(tf.math.reduce_sum(input_mask_expanded, 1), 1e-9)


#Encode text
def encode(texts):
    # Tokenize sentences
    encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='tf')

    # Compute token embeddings
    model_output = model(**encoded_input, return_dict=True)

    # Perform pooling
    embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

    # Normalize embeddings
    embeddings = tf.math.l2_normalize(embeddings, axis=1)

    return embeddings


# Sentences we want sentence embeddings for
query = "How many people live in London?"
docs = ["Around 9 Million people live in London", "London is known for its financial district"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/multi-qa-MiniLM-L6-cos-v1")
model = TFAutoModel.from_pretrained("sentence-transformers/multi-qa-MiniLM-L6-cos-v1")

#Encode query and docs
query_emb = encode(query)
doc_emb = encode(docs)

#Compute dot score between query and all document embeddings
scores = (query_emb @ tf.transpose(doc_emb))[0].numpy().tolist()

#Combine docs & scores
doc_score_pairs = list(zip(docs, scores))

#Sort by decreasing score
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)

#Output passages & scores
for doc, score in doc_score_pairs:
    print(score, doc)

技术细节

以下是关于如何使用此模型的一些技术细节:

Setting Value
Dimensions 384
Produces normalized embeddings Yes
Pooling-Method Mean pooling
Suitable score functions dot-product ( util.dot_score ), cosine-similarity ( util.cos_sim ), or euclidean distance

注意:使用sentence-transformers加载后,此模型会产生长度为1的归一化嵌入。在这种情况下,点积和余弦相似度是等价的。由于点积运算速度更快,因此推荐使用点积。欧氏距离与点积成正比,也可以使用。

背景

该项目旨在使用自监督的对比学习目标在非常大的句子级数据集上训练句子嵌入模型。我们使用对比学习目标:给定一对句子,模型应预测在我们的数据集中与其实际配对的一组随机取样的其他句子是哪一个。

在由Hugging Face组织的 Community week using JAX/Flax for NLP & CV 中,我们开发了这个模型。我们作为项目的一部分开发了这个模型: Train the Best Sentence Embedding Model Ever with 1B Training Pairs 。我们从谷歌的Flax、JAX和云团队成员的高效深度学习框架干预中受益,并获得了运行项目的高效硬件基础设施7个TPU v3-8。

预期使用

我们的模型适用于语义搜索:它将查询/问题和文本段落编码为稠密向量空间。它找到与给定段落相关的文档。

请注意,有一个512个词块的限制:超过这个限制的文本将被截断。进一步注意,该模型只是在250个词块以内的输入文本上进行训练的。对于较长的文本可能效果不好。

训练过程

完整的训练脚本可以在当前的仓库train_script.py中找到。

预训练

我们使用了预训练的 nreimers/MiniLM-L6-H384-uncased 模型。有关预训练过程的更多详细信息,请参阅模型卡。

训练

我们使用多个数据集的连接来微调我们的模型。总共有约215M个(问题,答案)对。我们根据在data_config.json文件中详细说明的加权概率对每个数据集进行抽样。

该模型使用 MultipleNegativesRankingLoss 进行训练,使用Mean-pooling作为相似度函数,并具有20的缩放比例。

Dataset Number of training tuples
12312321 Duplicate question pairs from WikiAnswers 77,427,422
12313321 Automatically generated (Question, Paragraph) pairs for each paragraph in Wikipedia 64,371,441
12314321 (Title, Body) pairs from all StackExchanges 25,316,456
12314321 (Title, Answer) pairs from all StackExchanges 21,396,559
12316321 Triplets (query, answer, hard_negative) for 500k queries from Bing search engine 17,579,773
12317321 (query, answer) pairs for 3M Google queries and Google featured snippet 3,012,496
12318321 (Question, Answer) pairs from Amazon product pages 2,448,839
12319321 (Title, Answer) pairs from Yahoo Answers 1,198,260
12319321 (Question, Answer) pairs from Yahoo Answers 681,164
12319321 (Title, Question) pairs from Yahoo Answers 659,896
12322321 (Question, Answer) pairs for 140k questions, each with Top5 Google snippets on that question 582,261
12323321 (Question, Answer) pairs from Reddit ELI5 (explainlikeimfive) 325,475
12314321 Duplicate questions pairs (titles) 304,525
12325321 (Question, Duplicate_Question, Hard_Negative) triplets for Quora Questions Pairs dataset 103,663
12326321 (Question, Paragraph) pairs for 100k real Google queries with relevant Wikipedia paragraph 100,231
12327321 (Question, Paragraph) pairs from SQuAD2.0 dataset 87,599
12328321 (Question, Evidence) pairs 73,346
Total 214,988,242