Announcing BM25S, a super fast lexical search library integrated with Huggingface

Announcing :zap:BM25S, a fast lexical retrieval library.

:racing_car: Up to 500x faster than the most popular Python lib, matches @Elastic search (BM25 defaults)
:hugs: First BM25 library that is directly integrated with @huggingface hub: load or save in 1 line!

With Python-based implementation like BM25S and Rank-BM25, you can tokenize your text, index and retrieve in ~10 lines.

However, simply implementing with Numpy may not achieve the same speed as Java-based implementation.

BM25S is different: it uses scipy to store eager scores.

import bm25s
from bm25s.hf import BM25HF

# Create a BM25 retriever and index the corpus
corpus = [
    "a cat is a feline and likes to purr", # ...
]
retriever = BM25HF(corpus=corpus)
retriever.index(bm25s.tokenize(corpus))

# Save index, config and corpus in to 🤗 Hugging Face Hub in 1 line!
retriever.save_to_hub("user/bm25s-index", corpus=corpus)

# Load any bm25s-index from 🤗 Hugging Face Hub
retriever = BM25HF.load_from_hub("user/bm25s-index")

Beyond that, it allows using memory-mapping instead of loading everything in memory, which substantially reduces RAM usage.

This allows you to query across millions of documents in real time on a single CPU thread.

Here’s a collection of indices for public BEIR datasets: BM25S Indices - a xhluca Collection

That said, BM25S stands on the shoulders of giants: rank-bm25 (1st python implementation), pyserini and bm25-pt (which inspired this project). Really was only possible to build BM25S thanks to those implementations!