Hi everyone,
I’m sharing a small open-source toolkit I built for people working with RAG pipelines or dataset distillation.
distill_rag provides the early stages of a distillation workflow:
- Clean HTML extraction (scripts/ads/nav/boilerplate removed)
- Structured {title, turns} session conversion
- Long-chunking tuned for distillation (5–9k chars)
- Local embeddings (Ollama-compatible)
- Elasticsearch v8 vector index creation
- Simple semantic search API (BM25, vector, hybrid)
- Full test suite + CLI
The goal is to stay lightweight, transparent, and easy to extend for research archives, spiritual transcripts, or long-form Q&A datasets.
Why It Stands Out: Speed & Simplicity
Built in Node.js with async promises, it’s blazing-fast—often 5–10× quicker than Python tools like LlamaIndex or LangChain. Process thousands of chunks in minutes on consumer GPUs (e.g., 3,565 chunks in 1:48s on RTX 3090 at 33/s). GPU-bound on embeddings, but extraction/chunking/indexing is lightning-quick with no bloat. Perfect for JS devs or quick local setups!
- HF Space: distill-rag - a Hugging Face Space by htaf
- GitHub: GitHub - elspru/distill_rag – Semantic chunking, embedding, and Elasticsearch indexing pipeline. Includes dataset extractors, data cleaners, a modular embedding layer, and an automated test suite.
If this helps your own RAG or dataset-cleaning workflow, I’d love to hear feedback or suggestions.
Thanks for reading!