Docling Studio 0.4.0 — from OCR debugger to RAG pipeline inspection tool

Hey everyone,

Just shipped Docling Studio 0.4.0 and wanted to share here since the project started getting traction on HF.

Quick recap: Docling Studio is a visual inspection tool for Docling (IBM Research / LF AI & Data). You convert a PDF, you see bounding boxes, chunks, layout — everything Docling extracts, rendered visually so you can actually debug what’s going on.

That part is still there and unchanged. But 0.4.0 adds something I’ve been working toward for a while: a full ingestion pipeline.

The flow is now: Docling → chunking → embedding (sentence-transformers) → OpenSearch. End-to-end, orchestrated, with idempotent re-ingestion.

Why does this matter? If you’re building RAG on top of Docling, at some point your retrieval gives bad results and you need to figure out why. Was the chunking wrong? Did a table get split across two chunks? Is there garbage text from a bad OCR region? Docling Studio now lets you visually inspect what’s actually in your vector store, edit chunk text inline, soft-delete chunks that shouldn’t be there, and search across indexed content.

A few things worth noting:

  • The whole ingestion pipeline is opt-in via feature flags. No OPENSEARCH_URL set → no ingestion UI, no extra dependencies, same lightweight image as before. People using it as a pure OCR debugger won’t notice any difference.

  • Architecture is hexagonal (ports & adapters). OpenSearch is the first VectorStore adapter. The port is a Python Protocol with 5 methods — adding another store is straightforward.

  • 541 tests (380 backend, 161 frontend) including Karate E2E tests covering the full PDF-to-OpenSearch flow.

  • Still ships as a single Docker image, multi-arch.

You can try it right now:

docker pull ghcr.io/scub-france/docling-studio:0.4.0-remote

Or check the repo: github.com/scub-france/Docling-Studio

There’s also a demo on HF Spaces (OCR debug mode only, no ingestion there obviously): huggingface.co/spaces/Pier-Jean/Docling-Studio

Would love to hear feedback — especially from people building RAG pipelines with Docling. What vector store would you want to see next? What’s your biggest pain point when debugging retrieval quality?

1 Like