Get scores from Whisper using ASR pipeline

NicolasDenier · September 6, 2023, 3:21pm

Hello, I am using Whisper to transcribe text, and I would like to get the confidence of the model for each token. I am using an AutomaticSpeechRecognitionPipeline and I found that I can pass a dict in generate_kwargs when calling the pipeline. This dict can have a generation_config parameter which I configured to be a GenerationConfig.from_pretrained('openai/whisper-medium') object. From here, I see that I can pass output_scores=True to return the scores. However, this does not return a different result, and I am afraid that this parameters are general for Transformers based models and does not apply to Whisper.

So, am I doing it wrong, or is their another way to get Whisper output’s scores ?

Bjornedt · September 6, 2023, 10:25pm

Hi!

Whisper’s method to obtain confidence scores differs from other transformer models.

From the whisper docs:

Whisper relies on accurate prediction of the timestamp tokens to determine the
amount to shift the model’s 30-second audio context window by, and inaccurate transcription in one window may negatively impact transcription in the subsequent windows.
We have developed a set of heuristics that help avoid failure cases of long-form transcription, which is applied in the results reported in sections 3.8 and 3.9. First, we use beam search with 5 beams using the log probability as the score function, to reduce repetition looping which happens more frequently in greedy decoding. We start with temperature 0, i.e. always selecting the tokens with the highest probability, and increase the temperature by 0.2 up to 1.0 when either the average log probability over the generated tokens is lower than −1 or the generated text has a gzip compression rate higher than 2.4.

When looking at the update method in the decoding.py of the whisper code you can see that the model computes log probabilities for each token and then ranks sequences based on their cumulative log probabilities.Crucially, these log probabilities provide sequence-level confidences, meaning they represent the confidence of entire sequences of tokens rather than individual tokens.

Hope this helps!

flabbaf97 · April 2, 2024, 3:08pm

I have the same problem. I used to use openAI whisper API, and I could have the probabilities there. But then I switched to huggig face, and here I can’t find a way to return probability of each word.

Topic		Replies	Views
Confidence Scores / Self-Training for Wav2Vec2 / CTC models Research	1	3733	April 21, 2022
Confidence Scores / Self-Training for Wav2Vec2 / CTC models With LM (PyCTCDecode) Research	1	2908	April 21, 2022
Confidence Score For Wav2Vec? 🤗Transformers	0	246	August 8, 2022
Log probabilities from openai whisper model corresponding to token/word speech to text task Beginners	2	811	June 25, 2024
Text generation pipeline - output_scores parameter Models	1	3952	January 20, 2021

Get scores from Whisper using ASR pipeline

Related topics