Get scores from Whisper using ASR pipeline

Hello, I am using Whisper to transcribe text, and I would like to get the confidence of the model for each token. I am using an AutomaticSpeechRecognitionPipeline and I found that I can pass a dict in generate_kwargs when calling the pipeline. This dict can have a generation_config parameter which I configured to be a GenerationConfig.from_pretrained('openai/whisper-medium') object. From here, I see that I can pass output_scores=True to return the scores. However, this does not return a different result, and I am afraid that this parameters are general for Transformers based models and does not apply to Whisper.

So, am I doing it wrong, or is their another way to get Whisper output’s scores ?

Hi!

Whisper’s method to obtain confidence scores differs from other transformer models.

From the whisper docs:

Whisper relies on accurate prediction of the timestamp tokens to determine the
amount to shift the model’s 30-second audio context window by, and inaccurate transcription in one window may negatively impact transcription in the subsequent windows.
We have developed a set of heuristics that help avoid failure cases of long-form transcription, which is applied in the results reported in sections 3.8 and 3.9. First, we use beam search with 5 beams using the log probability as the score function, to reduce repetition looping which happens more frequently in greedy decoding. We start with temperature 0, i.e. always selecting the tokens with the highest probability, and increase the temperature by 0.2 up to 1.0 when either the average log probability over the generated tokens is lower than −1 or the generated text has a gzip compression rate higher than 2.4.

When looking at the update method in the decoding.py of the whisper code you can see that the model computes log probabilities for each token and then ranks sequences based on their cumulative log probabilities.Crucially, these log probabilities provide sequence-level confidences, meaning they represent the confidence of entire sequences of tokens rather than individual tokens.

Hope this helps!

2 Likes