Timestamps reduce Whisper hallucinations?

There are several threads where it’s claimed that using return_timestamps=True in Whisper grounds the model and discourages from hallucinating. Are there any pointer about why that helps?

1 Like

I don’t think it’s something that’s been explored formally, but empirically most people have found that setting return_timestamps=True helps reduce hallucinations, particularly when doing long-form evaluation with Transformers’ “chunked” algorithm (note that timestamps are a requirement for OpenAI’s “sequential” one)

My interpretation is that forcing the model to predict timestamps is contradictory to hallucinations. Suppose you have the transcription:

The cat sat on the on the on the mat.

Where we have a repeated hallucination for “on the”. If we ask the model to predict timestamps, then the “on the” has to contribute to the overall segment-level timing, e.g.:

<|0.00|> The cat sat on the on the on the mat.<|5.02|>

But it’s impossible to fit 3 copies of “on the” within the time allocation given to the segment, so the probability for this hallucinatory sequence becomes lower, and the model actually predicts the correct transcription with highest probability:

<|0.00|> The cat sat on the mat.<|5.02|>

Interesting there’s not much formal exploration on this. The Whisper authors focus more on other heuristics for long-form transcription, such as temperature fallback (c.f. Section 4.5 of the paper)

The end timestamp is kind of the opposite of the initial timestamp constraint they describe in the paper → it helps the model remove extra words at the end of the sequence (rather than the initial timestamp which helps when the model ignores words at the start), but the overall principle is the same (using timestamps to improve the probability of more realistic sequences)