[Tool Release] distill_rag — clean HTML extraction → long-chunking → embeddings → Elasticsearch search

Hi everyone,
I’m sharing a small open-source toolkit I built for people working with RAG pipelines or dataset distillation.

distill_rag provides the early stages of a distillation workflow:

  • Clean HTML extraction (scripts/ads/nav/boilerplate removed)
  • Structured {title, turns} session conversion
  • Long-chunking tuned for distillation (5–9k chars)
  • Local embeddings (Ollama-compatible)
  • Elasticsearch v8 vector index creation
  • Simple semantic search API (BM25, vector, hybrid)
  • Full test suite + CLI

The goal is to stay lightweight, transparent, and easy to extend for research archives, spiritual transcripts, or long-form Q&A datasets.

:high_voltage: Why It Stands Out: Speed & Simplicity

Built in Node.js with async promises, it’s blazing-fast—often 5–10× quicker than Python tools like LlamaIndex or LangChain. Process thousands of chunks in minutes on consumer GPUs (e.g., 3,565 chunks in 1:48s on RTX 3090 at 33/s). GPU-bound on embeddings, but extraction/chunking/indexing is lightning-quick with no bloat. Perfect for JS devs or quick local setups!

If this helps your own RAG or dataset-cleaning workflow, I’d love to hear feedback or suggestions.
Thanks for reading!

1 Like

Benchmark Example: On an RTX 3090 with 1531 sessions (~3565 chunks, 5000–9000 chars each), a full index rebuild takes just 1m48s (33 chunks/s). Comparable Python tools often take 8–40 minutes for similar workloads due to wrapper latencies and inefficient batching.

1 Like