Whisper: padding issues while transcribing

hassanshuman · January 1, 2025, 4:37pm

Hi!

I am fine-tuning Whisper Small, and when looking at loss curves, validation during training etc. it looks like everything is working!

However, when I test my trained model to transcribe some audio, it (almost perfectly) transcribes the audio and then adds some “nonsense” to the end of it. My suspicion is that this is because I set

processor.tokenizer.pad_token = processor.tokenizer.eos_token

Interestingly, the length of the “nonsense” is semi-consistent no matter the length of the audio.

Here’s how I tested my trained model on a single mp3.

processor = WhisperProcessor.from_pretrained(fine_tuned_model_path)
model = WhisperForConditionalGeneration.from_pretrained(fine_tuned_model_path)

audio_segment = AudioSegment.from_mp3(audio_path)
wav_io = BytesIO()
audio_segment.export(wav_io, format="wav")
wav_io.seek(0)
audio, _ = librosa.load(BytesIO(wav_io.read()), sr=16000)

inputs = processor(
    audio,
    sampling_rate=16000,
    return_tensors="pt",
    padding="max_length",
    truncation = True
)

print(inputs)

print(inputs.input_features)
with torch.no_grad():
    generated_ids = model.generate(inputs.input_features, pad_token_id = processor.tokenizer.pad_token_id)

transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print("Transcription:", transcription)

Example of the output for a 27 second mp3:
> hypertension, i.e. hypertension caused by an underlying condition, Table 4. Information on behaviour, including tobacco use, physical activity, and dietary intake of fat, salt and alcohol. Personal, psychosocial, occupational and environmental factors that could infl uence the course and outcome of long-term care. Physical examination – a full physical...................................................................................................., physicalé- a

And with skip_special_tokens = False:

Transcription: <|startoftranscript|><|en|><|transcribe|><|notimestamps|>hypertension, i.e. hypertension caused by an underlying condition, Table 4. Information on behaviour, including tobacco use, physical activity, and dietary intake of fat, salt and alcohol. Personal, psychosocial, occupational and environmental factors that could infl uence the course and outcome of long-term care. Physical examination – a full physical…, physicalé-<|pt|><|notimestamps|><|id|><|notimestamps|><|id|><|notimestamps|><|id|><|notimestamps|><|id|><|notimestamps|><|id|><|notimestamps|><|id|><|notimestamps|><|id|><|notimestamps|><|id|><|id|><|notimestamps|><|id|> a<|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|>

Have I made any obvious mistakes? I really want Whisper to essentially make no predictions on padded audio.

Topic		Replies	Views
Processor :: pad Ignores Padding? Beginners	1	770	November 22, 2023
Finetuning whisper attention mask not set and canot be inferred 🤗Transformers	4	5565	July 20, 2024
Using Padding for ASR models 🤗Transformers	0	326	December 16, 2022
Whisper warning about not predicting end of a timestamp 🤗Transformers	1	1527	June 20, 2025
Fintune whisper model returns exclamation marks 🤗Transformers	1	542	August 7, 2023

Whisper: padding issues while transcribing

Related topics