This is a txtai embeddings index for the English edition of Wikipedia .
This index is built from the OLM Wikipedia December 2022 dataset . Only the first paragraph of the lead section from each article is included in the index. This is similar to an abstract of the article.
It also uses Wikipedia Page Views data to add a percentile field. The percentile field can be used to only match commonly visited pages.
txtai must be installed to use this model.
Version 5.4 added support for loading embeddings indexes from the Hugging Face Hub. See the example below.
from txtai.embeddings import Embeddings # Load the index from the HF Hub embeddings = Embeddings() embeddings.load(provider="huggingface-hub", container="neuml/txtai-wikipedia") # Run a search embeddings.search("Roman Empire") # Run a search matching only the Top 1% of articles embeddings.search(""" SELECT id, text, score, percentile FROM txtai WHERE similar('Boston') AND percentile >= 0.99 """)
An embeddings index generated by txtai is a fully encapsulated index format. It doesn't require a database server or dependencies outside of the Python install.
The Wikipedia index works well as a fact-based context source for conversational search. In other words, search results from this model can be passed to LLM prompts as the context in which to answer questions.
See this article for additional examples on how to use this model.