数据集:
Cohere/wikipedia-22-12-zh-embeddings
We encoded Wikipedia (zh) using the cohere.ai multilingual-22-12 embedding model.
To get an overview how this dataset was created and pre-processed, have a look at Cohere/wikipedia-22-12 .
We compute for title+" "+text the embeddings using our multilingual-22-12 embedding model, a state-of-the-art model that works for semantic search in 100 languages. If you want to learn more about this model, have a look at cohere.ai multilingual embedding model .
We provide embeddings of Wikipedia in many different languages: ar , de , en , es , fr , hi , it , ja , ko , simple english , zh ,
You can find the Wikipedia datasets without embeddings at Cohere/wikipedia-22-12 .
You can either load the dataset like this:
from datasets import load_dataset docs = load_dataset(f"Cohere/wikipedia-22-12-zh-embeddings", split="train")
Or you can also stream it without downloading it before:
from datasets import load_dataset
docs = load_dataset(f"Cohere/wikipedia-22-12-zh-embeddings", split="train", streaming=True)
for doc in docs:
    docid = doc['id']
    title = doc['title']
    text = doc['text']
    emb = doc['emb']
 A full search example:
#Run: pip install cohere datasets
from datasets import load_dataset
import torch
import cohere
co = cohere.Client(f"<<COHERE_API_KEY>>")  # Add your cohere API key from www.cohere.com
#Load at max 1000 documents + embeddings
max_docs = 1000
docs_stream = load_dataset(f"Cohere/wikipedia-22-12-zh-embeddings", split="train", streaming=True)
docs = []
doc_embeddings = []
for doc in docs_stream:
    docs.append(doc)
    doc_embeddings.append(doc['emb'])
    if len(docs) >= max_docs:
        break
doc_embeddings = torch.tensor(doc_embeddings)
query = 'Who founded Youtube'
response = co.embed(texts=[query], model='multilingual-22-12')
query_embedding = response.embeddings 
query_embedding = torch.tensor(query_embedding)
# Compute dot score between query embedding and document embeddings
dot_scores = torch.mm(query_embedding, doc_embeddings.transpose(0, 1))
top_k = torch.topk(dot_scores, k=3)
# Print results
print("Query:", query)
for doc_id in top_k.indices[0].tolist():
    print(docs[doc_id]['title'])
    print(docs[doc_id]['text'], "\n")
 You can find performance on the MIRACL dataset (a semantic search evaluation dataset) here: miracl-en-queries-22-12#performance