Whisper decoder is slow for ASR task

ksoky · November 29, 2022, 5:27am

I have followed this blog to finetune the ASR model.
The training is working fine. However, the decoding time is very slow.

Are there hyperparameters to be optimized for speeding up the decoder of Whisper?
Or is there a possibility to customize the decoder of Whisper?

sanchit-gandhi · November 29, 2022, 12:59pm

Hey @ksoky!

Seq2Seq models perform generate text through autoregressive generation of the decoder (see Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers for details). So, we perform as many forward passes of the decoder as tokens generated.

Running generation in “greedy” mode will be much faster than beam search (we use greedy by default).

You can also explore reducing the “max_length”:

model.config.max_length = 100

Will generate 100 tokens max. But this will almost certainly reduce your overall performance, as you’ll truncate some sentences short.

Are you running inference on GPU? It shouldn’t be too slow with the “small” checkpoint on most GPU devices!

Alternatively, you can try training one of the smaller checkpoints (“base” or “tiny”) for faster inference.

ksoky · November 29, 2022, 1:11pm

Dear @sanchit-gandhi,

Thanks for your suggestions.
I will try again and come back soon.

Best,

therealdj · November 26, 2023, 6:26am

How did it turn out? I ran into the same issue, where it feels like it is taking 3-6x longer to predict than whisper medium.en took. I have a need of rapid near live transcriptions.
This is my code so far:

import os
import csv
import whisper
from transformers import WhisperForConditionalGeneration, WhisperConfig, WhisperModel, WhisperProcessor, WhisperTokenizer, WhisperFeatureExtractor
from transformers import pipeline

path_to_model = 'path_to_model'

model = WhisperForConditionalGeneration.from_pretrained(path_to_model)
model.config.max_length=150
processor = WhisperProcessor.from_pretrained(path_to_model, language="english", task="automatic-speech-recognition",
                                             generation_num_beams=1)
tokenizer = WhisperTokenizer.from_pretrained(path_to_model,
                                             generation_num_beams=1)
featureextractor = WhisperFeatureExtractor.from_pretrained(path_to_model)

pipe = pipeline(
    task = 'automatic-speech-recognition',
    model = model,
    tokenizer = tokenizer,
    feature_extractor = featureextractor,
    chunk_length_s=15
)

def transcribe(audio):
    text = pipe(audio)["text"]
    return text

transcription = transcribe(file_path)

Topic		Replies	Views
Different inference speed for finetuned Whisper models Beginners	0	395	February 28, 2024
Whisper fine-tuning without Seq2SeqTrainer Intermediate	0	347	December 15, 2023
Repeatedly decoding tokens multiple times after PEFT fine-tuning whisper Intermediate	2	734	September 20, 2023
Help about Whisper chunk_length Beginners	1	157	February 15, 2025
Finetuned whisper model translating instead of transcribing 🤗Transformers	2	733	December 31, 2023

Whisper decoder is slow for ASR task

Related topics