Whisper: padding issues while transcribing

Hi!

I am fine-tuning Whisper Small, and when looking at loss curves, validation during training etc. it looks like everything is working!

However, when I test my trained model to transcribe some audio, it (almost perfectly) transcribes the audio and then adds some “nonsense” to the end of it. My suspicion is that this is because I set

processor.tokenizer.pad_token = processor.tokenizer.eos_token

Interestingly, the length of the “nonsense” is semi-consistent no matter the length of the audio.

Here’s how I tested my trained model on a single mp3.

processor = WhisperProcessor.from_pretrained(fine_tuned_model_path)
model = WhisperForConditionalGeneration.from_pretrained(fine_tuned_model_path)

audio_segment = AudioSegment.from_mp3(audio_path)
wav_io = BytesIO()
audio_segment.export(wav_io, format="wav")
wav_io.seek(0)
audio, _ = librosa.load(BytesIO(wav_io.read()), sr=16000)

inputs = processor(
    audio,
    sampling_rate=16000,
    return_tensors="pt",
    padding="max_length",
    truncation = True
)

print(inputs)

print(inputs.input_features)
with torch.no_grad():
    generated_ids = model.generate(inputs.input_features, pad_token_id = processor.tokenizer.pad_token_id)

transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print("Transcription:", transcription)

Example of the output for a 27 second mp3:
> hypertension, i.e. hypertension caused by an underlying condition, Table 4. Information on behaviour, including tobacco use, physical activity, and dietary intake of fat, salt and alcohol. Personal, psychosocial, occupational and environmental factors that could infl uence the course and outcome of long-term care. Physical examination – a full physical...................................................................................................., physicalé- a

And with skip_special_tokens = False:

Transcription: <|startoftranscript|><|en|><|transcribe|><|notimestamps|>hypertension, i.e. hypertension caused by an underlying condition, Table 4. Information on behaviour, including tobacco use, physical activity, and dietary intake of fat, salt and alcohol. Personal, psychosocial, occupational and environmental factors that could infl uence the course and outcome of long-term care. Physical examination – a full physical…, physicalé-<|pt|><|notimestamps|><|id|><|notimestamps|><|id|><|notimestamps|><|id|><|notimestamps|><|id|><|notimestamps|><|id|><|notimestamps|><|id|><|notimestamps|><|id|><|notimestamps|><|id|><|id|><|notimestamps|><|id|> a<|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|><|id|>

Have I made any obvious mistakes? I really want Whisper to essentially make no predictions on padded audio.

1 Like