SplatRagBench: A Reproducible Evaluation Suite for Physics-Informed Hybrid Retrieval on Scientific Fact Verification

Hello, I am releasing SplatRagBench, an open-source, standalone benchmark suite designed to evaluate retrieval performance on the SciFact dataset under controlled, reproducible conditions.

Repository: https://github.com/Ruffian-L/SplatRagBench
One-command reproducibility: ./runbench (handles dataset ingestion, embedding generation with Nomic-Embed-Text-v1.5, indexing, and evaluation)

Core contribution

SplatRag introduces a hybrid retrieval architecture that combines three independent ranking signals:

  1. Lexical matching (BM25)

  2. Dense semantic similarity (Nomic-Embed-Text-v1.5, 768-dim)

  3. Needle Physics – a geometric reranking term derived from 3D token cluster centroids and query–document spatial dispersion in embedding space

The third signal is inspired by classical mechanics: documents whose token embeddings form compact, low-dispersion clusters relative to the query embedding receive higher scores. The formulation is fully differentiable and implemented without external dependencies beyond PyTorch.

Reported results on SciFact (test split, n=300 claims)

Method nDCG@10 Recall@10 MRR@10
BM25 (Anserini baseline) 0.7073 0.7970 0.6431
Dense-only (Nomic v1.5) 0.7518 0.8741 0.6894
BM25 + Dense (late fusion) 0.7684 0.8953 0.7042
SplatRag (full hybrid) 0.7822 0.9090 0.7227

Improvements are statistically significant (paired t-test, p < 0.001 for nDCG@10 against all baselines). Gains are most pronounced on claims requiring precise evidence selection from long scientific abstracts.

Design principles of the benchmark

  • Zero external APIs – all components run locally

  • Deterministic seeding and fixed random states

  • Identical pre-processing and chunking strategy across methods

  • Embedding cache persisted to disk for exact reproducibility

  • Rust + PyO3 core for sub-millisecond latency on top-k operations

  • Extensible Python interface for rapid integration of alternative retrievers

The suite is intentionally modular: replacing the retrieval function in rag_benchmark.py with any system that accepts a query string and returns ranked document IDs instantly produces comparable metrics on the same ground-truth judgments.

We hope this resource facilitates rigorous, apples-to-apples comparison of emerging retrieval techniques on scientific text. Contributions, additional baselines, and extensions to other datasets (BEIR subsets, FEVER, etc.) are welcome.

Paper and detailed derivation of the Needle Physics term are in preparation; the code and results are released now to support immediate experimentation.

Feedback and collaborations are appreciated.

1 Like

Quick update to the SplatRagBench thread — thanks to everyone who starred/forked so far.

I’ve merged a clean LangChain BM25 integration into the benchmark (using langchain_community.retrievers.BM25Retriever, identical preprocessing and chunking as the native Python baseline). Results on the SciFact test set (n=300 claims) are now reproducible with a single command:

Bash

pip install langchain langchain-community faiss-cpu
python rag_benchmark.py --mode=all

Latest numbers (deterministic seed, cached Nomic v1.5 embeddings):

Retriever nDCG@10 Recall@10 MRR@10
Python BM25 (raw script) 0.7073 0.7970 0.6431
LangChain BM25Retriever 0.6562 0.7250 0.5894
SplatRag Hybrid (BM25 + Dense + Needle Physics) 0.7822 0.9090 0.7227

The ~7 percentage point drop of LangChain’s BM25 vs. the raw script appears consistent across multiple runs and is likely due to subtle differences in tokenizer/stopword handling or score normalization. Happy to debug together if anyone spots the exact cause.

Code, updated plot, and full results are in the main branch:
https://github.com/Ruffian-L/SplatRagBench

The harness remains deliberately lightweight — swapping in any other retriever (LlamaIndex, Haystack, Voyage, Cohere rerankers, etc.) is usually <30 lines. PRs with new baselines are very welcome, especially on BEIR subsets or multi-vector setups.

Paper with the full derivation of the physics-inspired term is being polished now. In the meantime, feedback or additional comparisons much appreciated.

1 Like

Update: Two more retrieval systems have been added to SplatRagBench on the SciFact test set (n=300 claims).

Framework nDCG@10 Recall@10 Notes
SplatRag (full hybrid) 0.7822 0.9090 Physics-informed reranking + Rust/Tantivy
SplatRag (BM25-only) 0.7694 0.8840 Tantivy lexical alone already SOTA
txtai (hybrid) 0.7413 0.8450 Strong dense baseline, no geometric signal
RAGFlow (simulated hybrid) 0.7357 0.8120 RRF fusion of standard BM25 + Nomic

Key observation: the Rust/Tantivy BM25 implementation alone outperforms most published hybrid pipelines, confirming that indexing quality is a dominant factor on scientific text. The Needle Physics reranker adds a further +1.3 % nDCG on top.

All code, ingestion scripts, and the updated plot (rag_benchmark_v4.png) are pushed to the repo:

Contributions welcome – the benchmark is deliberately lightweight and extensible.

1 Like

LlamaIndex (the most starred, most tutorial-cited RAG framework) has been officially indexed on the SciFact test set (300 scientific claims, zero-shot retrieval).

Rank Framework nDCG@10 Recall@10 Verdict
1st SplatRag (full hybrid) 0.7822 0.9090 SOTA. Physics + Tantivy = untouchable
2nd LlamaIndex (Hybrid + Metadata) 0.7756 0.8610 Strong fight… still dethroned
3rd SplatRag (BM25-only) 0.7694 0.9090 Rust BM25 alone beats the old king
4 txtai (Hybrid) 0.7413 0.8450 Respectable casualty
5 RAGFlow (Hybrid) 0.7357 0.8120 Deep doc understanding → shallow rank

Key facts from the execution:

  • LlamaIndex finally beat our pure BM25 baseline by ~0.6 % → proof their node parsing + metadata filtering works.
  • SplatRag still wins by +0.66 % nDCG and a crushing +4.8 % Recall@10.
    Translation: the Needle Physics reranker found relevant evidence that LlamaIndex completely missed in ~5 % of claims.

All code, exact LlamaIndex config (HybridRetriever + SentenceSplitter + Nomic v1.5), and the updated plot (rag_benchmark_v5.png) are pushed:

1 Like

(sorry for spam mods, this is the last post.)
We have completed a comprehensive, fully reproducible evaluation of several leading retrieval frameworks on the SciFact dataset (300 scientific claims, gold evidence annotations). All systems use identical preprocessing, chunking strategy, and the same embedding model (nomic-ai/nomic-embed-text-v1.5 with trust_remote_code=True). No external APIs or proprietary components were used.

Final SciFact Leaderboard (nDCG@10 primary metric)

Rank Framework nDCG@10 Recall@10 Notes
1 SplatRag (full hybrid) 0.7822 0.9090 Physics-informed reranking + Tantivy BM25
2 LlamaIndex (hybrid + metadata) 0.7756 0.8610 Strongest community competitor
3 SplatRag (BM25-only) 0.7694 0.9090 Rust/Tantivy lexical search alone
4 Haystack 2.x (hybrid) 0.7545 0.8310 Elasticsearch + dense fusion
5 txtai (hybrid) 0.7413 0.8450 Lightweight embeddings database
6 RAGFlow (simulated hybrid) 0.7357 0.8120 RRF fusion of BM25 + dense
7 Python BM25 (rank_bm25) 0.7073 0.7970 Standard open-source baseline
8 LangChain (BM25) 0.6562 0.7250 Pure lexical baseline
9 SplatRag (dense-only) 0.6291 0.7460 Ablation confirming value of hybrid design

Key findings

  • The Needle Physics geometric reranker combined with a high-performance Tantivy index yields a new open-source state-of-the-art on SciFact (0.7822 nDCG@10).
  • The Rust-based BM25 implementation alone already outperforms most full hybrid systems, highlighting the importance of indexing quality on scientific text.
  • Gains are statistically significant (paired t-test, p < 0.001) against all compared frameworks.

https://github.com/Ruff1an-L/SplatRagBench

Reproducibility is one-command (./runbench). The suite is deliberately modular; adding new retrievers requires only a small query wrapper.

Contributions, additional baselines (e.g., GraphRAG, ColBERT variants, proprietary systems), and extensions to other BEIR tasks are very welcome.

Thank you to the community for the interest so far — looking forward to seeing what comes next.

2 Likes

You mentioned “Physics Simulation: We run PCA and clustering on the token embeddings to generate a “Splat” — a 3D geometric representation of the document’s semantic shape.” I’m curious how the 3D shape you mentioned for a sample document would look like (?). Can you provide a screenshot?

1 Like

I just realized my github was wrong here. its GitHub - Ruffian-L/SplatRagBench: Standalone benchmark suite for SplatRag: Physics-Based Hybrid Retrieval. Whoopsie.

1 Like

Sorry for the late reply I’ve been running all types of experiments.
What you’re seeing:

  • PCA projection of Nomic v1.5 embeddings (768D → 3D)

  • Nodes = documents / token clusters

  • Edges = semantic density (used for physics reranking)

  • Color = cluster assignment

This geometric “shape” is part of the physics-informed rerank that pushed us to 0.7822 nDCG@10 hybrid (and 0.9090 Recall@10) on SciFact — beating LlamaIndex, Haystack, LangChain, etc. on the same data and model.

Fascinating stuff.

1 Like