Announcing BM25S, a super fast lexical search library integrated with Huggingface

xhluca · June 18, 2024, 4:39pm

Announcing BM25S, a fast lexical retrieval library.

Up to 500x faster than the most popular Python lib, matches @Elastic search (BM25 defaults)
First BM25 library that is directly integrated with @huggingface hub: load or save in 1 line!

With Python-based implementation like BM25S and Rank-BM25, you can tokenize your text, index and retrieve in ~10 lines.

However, simply implementing with Numpy may not achieve the same speed as Java-based implementation.

BM25S is different: it uses scipy to store eager scores.

import bm25s
from bm25s.hf import BM25HF

# Create a BM25 retriever and index the corpus
corpus = [
    "a cat is a feline and likes to purr", # ...
]
retriever = BM25HF(corpus=corpus)
retriever.index(bm25s.tokenize(corpus))

# Save index, config and corpus in to 🤗 Hugging Face Hub in 1 line!
retriever.save_to_hub("user/bm25s-index", corpus=corpus)

# Load any bm25s-index from 🤗 Hugging Face Hub
retriever = BM25HF.load_from_hub("user/bm25s-index")

Beyond that, it allows using memory-mapping instead of loading everything in memory, which substantially reduces RAM usage.

This allows you to query across millions of documents in real time on a single CPU thread.

Here’s a collection of indices for public BEIR datasets: BM25S Indices - a xhluca Collection

That said, BM25S stands on the shoulders of giants: rank-bm25 (1st python implementation), pyserini and bm25-pt (which inspired this project). Really was only possible to build BM25S thanks to those implementations!

Topic		Replies	Views
Exploring contexts of occurrence of particular words in large datasets Research	2	820	October 19, 2022
Huggingface distilbert-base-uncased-finetuned-sst-2-english runs out of ram with only a few kb? Beginners	0	373	May 12, 2022
Best way to deploy a SLM/LLM model. Best library and approach? Research	6	759	March 11, 2025
Retrieval Augmented Generation using Transformer Eco System 🤗Transformers	0	465	October 12, 2023
Using the jpelhaw / t5-word-sense-disambiguation model Beginners	2	685	April 14, 2022

Announcing BM25S, a super fast lexical search library integrated with Huggingface

Related topics