Hi all,
I am currently working on a paper about hallucinations in large language models and have been running several benchmarks, including WMT23 and Mistral. During this process, I noticed that not all hallucinations seem to fit neatly into a single category.
Some outputs appear to be entirely unrelated to the input or reference, making meaningful correction seem difficult. In other cases, the result seems more like a context or perspective shift. In those situations, the model’s answer might actually make sense if the context were made more explicit or clearly defined.
Especially with these context shifts, clarifying the reference or intended domain often helps, for example by specifying whether a translation should be in medical or everyday English. The ungrounded hallucinations, on the other hand, remain challenging to address even with additional context.
I am interested to hear whether others find it useful to distinguish these cases in evaluation and benchmarking, or if you consider both types equally. Have you had any experience with context-sensitive evaluation, or is such a distinction not very relevant for your work?
I would very much appreciate your perspectives and experiences on this topic.