TRACE Score — a metric for multi-turn LLM consistency

Built a metric that evaluates the full conversation arc instead of individual turns.

BERTScore for a conversation where the model ignores every user correction: 0.84.
TRACE for the same conversation: 0.61.

TRACE has five components — fact retention, self-contradiction, correction retention, topic coherence, confidence stability. Benchmarked on 102 conversations with Llama-3.1-8B. TRACE separates failure categories with a range of 0.277. BERTScore range is 0.044. The model retains user corrections 25% of the time. No per-turn metric can detect this.

PyPi Package: trace-score · PyPI

1 Like