Hi!
Whisper’s method to obtain confidence scores differs from other transformer models.
From the whisper docs:
Whisper relies on accurate prediction of the timestamp tokens to determine the
amount to shift the model’s 30-second audio context window by, and inaccurate transcription in one window may negatively impact transcription in the subsequent windows.
We have developed a set of heuristics that help avoid failure cases of long-form transcription, which is applied in the results reported in sections 3.8 and 3.9. First, we use beam search with 5 beams using the log probability as the score function, to reduce repetition looping which happens more frequently in greedy decoding. We start with temperature 0, i.e. always selecting the tokens with the highest probability, and increase the temperature by 0.2 up to 1.0 when either the average log probability over the generated tokens is lower than −1 or the generated text has a gzip compression rate higher than 2.4.
When looking at the update method in the decoding.py of the whisper code you can see that the model computes log probabilities for each token and then ranks sequences based on their cumulative log probabilities.Crucially, these log probabilities provide sequence-level confidences, meaning they represent the confidence of entire sequences of tokens rather than individual tokens.
Hope this helps!