Understand challenges in testing of AI Agents. Introducing Vero (open-source)

alakhagr · November 15, 2025, 2:32pm

Understanding Evaluation & Debugging Needs for LLM Pipelines - Introducing Vero (Early, Open Source, Looking for Critique)

Hey everyone, I’m exploring a direction around evaluating failure modes in LLM pipelines (agents, RAG), and would love to get feedback from the community.

I’ve started building an early open-source tool called Vero to test RAG/Agents with real-world edge cases by creating user-personas based on the business use-case, and generating test conversations.
It is aimed at mapping where a pipeline breaks and offering fix suggestions. It’s still rough, and I’m trying to understand what the actual needs are in this domain.

Repo (pip install available): GitHub - vero-labs-ai/vero-eval: Open source framework for evaluating AI Agents

Why I’m posting here?

I have built it without talking much to the users. Would love to know on what you think about it, and maybe get to know that one feature you want to have, and I’ll ship it

I’m also trying to identify if this is genuinely valuable, or if the real problems lie elsewhere.

Some specific questions that you can answer

What are the most important evaluation signals in agentic or multi-step pipelines that are missing from current tools (Evals, Ragascore, logging dashboards, etc.)?
Should evaluation focus more on local correctness (step-level) or global reliability (task-level)?
What evaluation tasks or benchmarks feel under-served right now?

Even blunt one-line responses help.

If you’ve tried to debug or evaluate complex LLM pipelines recently, I’d love to know what frustrated you the most.

Thanks already, I’ll refine Vero based on whatever I learn here.

Topic	Replies	Views
Say goodbye to manual testing of your LLM-based apps – automate with EvalMy.AI beta! 🚀 Research	70	October 29, 2024
how to evaluate RAG-end2end Models	456	June 29, 2021
Evaluating creative NLG 🤗Transformers	285	January 22, 2021
Safe_Mode = [True, False] Community Calls	264	December 27, 2023
Meta Persona an abstract adaptive neural construct Research	742	November 25, 2020

Understand challenges in testing of AI Agents. Introducing Vero (open-source)

Understanding Evaluation & Debugging Needs for LLM Pipelines - Introducing Vero (Early, Open Source, Looking for Critique)

Why I’m posting here?

Some specific questions that you can answer

Related topics