Announcing BM25S, a fast lexical retrieval library.
Up to 500x faster than the most popular Python lib, matches @Elastic search (BM25 defaults)
First BM25 library that is directly integrated with @huggingface hub: load or save in 1 line!
With Python-based implementation like BM25S and Rank-BM25, you can tokenize your text, index and retrieve in ~10 lines.
However, simply implementing with Numpy may not achieve the same speed as Java-based implementation.
BM25S is different: it uses scipy to store eager scores.
import bm25s
from bm25s.hf import BM25HF
# Create a BM25 retriever and index the corpus
corpus = [
"a cat is a feline and likes to purr", # ...
]
retriever = BM25HF(corpus=corpus)
retriever.index(bm25s.tokenize(corpus))
# Save index, config and corpus in to 🤗 Hugging Face Hub in 1 line!
retriever.save_to_hub("user/bm25s-index", corpus=corpus)
# Load any bm25s-index from 🤗 Hugging Face Hub
retriever = BM25HF.load_from_hub("user/bm25s-index")
Beyond that, it allows using memory-mapping instead of loading everything in memory, which substantially reduces RAM usage.
This allows you to query across millions of documents in real time on a single CPU thread.
Here’s a collection of indices for public BEIR datasets: BM25S Indices - a xhluca Collection
That said, BM25S stands on the shoulders of giants: rank-bm25 (1st python implementation), pyserini and bm25-pt (which inspired this project). Really was only possible to build BM25S thanks to those implementations!