[Tool Release] distill_rag — clean HTML extraction → long-chunking → embeddings → Elasticsearch search

htaf · November 20, 2025, 3:28am

Hi everyone,
I’m sharing a small open-source toolkit I built for people working with RAG pipelines or dataset distillation.

distill_rag provides the early stages of a distillation workflow:

Clean HTML extraction (scripts/ads/nav/boilerplate removed)
Structured {title, turns} session conversion
Long-chunking tuned for distillation (5–9k chars)
Local embeddings (Ollama-compatible)
Elasticsearch v8 vector index creation
Simple semantic search API (BM25, vector, hybrid)
Full test suite + CLI

The goal is to stay lightweight, transparent, and easy to extend for research archives, spiritual transcripts, or long-form Q&A datasets.

Why It Stands Out: Speed & Simplicity

Built in Node.js with async promises, it’s blazing-fast—often 5–10× quicker than Python tools like LlamaIndex or LangChain. Process thousands of chunks in minutes on consumer GPUs (e.g., 3,565 chunks in 1:48s on RTX 3090 at 33/s). GPU-bound on embeddings, but extraction/chunking/indexing is lightning-quick with no bloat. Perfect for JS devs or quick local setups!

HF Space: distill-rag - a Hugging Face Space by htaf
GitHub: GitHub - elspru/distill_rag – Semantic chunking, embedding, and Elasticsearch indexing pipeline. Includes dataset extractors, data cleaners, a modular embedding layer, and an automated test suite.

If this helps your own RAG or dataset-cleaning workflow, I’d love to hear feedback or suggestions.
Thanks for reading!

htaf · November 20, 2025, 12:32pm

Benchmark Example: On an RTX 3090 with 1531 sessions (~3565 chunks, 5000–9000 chars each), a full index rebuild takes just 1m48s (33 chunks/s). Comparable Python tools often take 8–40 minutes for similar workloads due to wrapper latencies and inefficient batching.

Topic		Replies	Views
Language model to search an answer in a huge collection of (unrelated) paragraphs Research	4	1534	July 6, 2021
RAG Class for Question Answering 🤗Transformers	0	466	October 22, 2020
Distilling T5-small for summarization 🤗Transformers	0	467	May 25, 2022
Retrieval Augmented Generation using Transformer Eco System 🤗Transformers	0	498	October 12, 2023
how to evaluate RAG-end2end Models	0	447	June 29, 2021

[Tool Release] distill_rag — clean HTML extraction → long-chunking → embeddings → Elasticsearch search

Why It Stands Out: Speed & Simplicity

Related topics