Understanding Evaluation & Debugging Needs for LLM Pipelines - Introducing Vero (Early, Open Source, Looking for Critique)
Hey everyone, I’m exploring a direction around evaluating failure modes in LLM pipelines (agents, RAG), and would love to get feedback from the community.
I’ve started building an early open-source tool called Vero to test RAG/Agents with real-world edge cases by creating user-personas based on the business use-case, and generating test conversations.
It is aimed at mapping where a pipeline breaks and offering fix suggestions. It’s still rough, and I’m trying to understand what the actual needs are in this domain.
Repo (pip install available): GitHub - vero-labs-ai/vero-eval: Open source framework for evaluating AI Agents
Why I’m posting here?
I have built it without talking much to the users. Would love to know on what you think about it, and maybe get to know that one feature you want to have, and I’ll ship it
I’m also trying to identify if this is genuinely valuable, or if the real problems lie elsewhere.
Some specific questions that you can answer
- What are the most important evaluation signals in agentic or multi-step pipelines that are missing from current tools (Evals, Ragascore, logging dashboards, etc.)?
- Should evaluation focus more on local correctness (step-level) or global reliability (task-level)?
- What evaluation tasks or benchmarks feel under-served right now?
Even blunt one-line responses help.
If you’ve tried to debug or evaluate complex LLM pipelines recently, I’d love to know what frustrated you the most.
Thanks already, I’ll refine Vero based on whatever I learn here.