Inverse normalising entities in Whisper

How can we normalize or inverse normalize certain entities when using Whisper?

  • In it’s pre-trained form, Whisper is biased against normalised entities (e.g. EAPs would be probably transcribed as something like ear and peas)
  • If you have certain entities that you expect to be normalised/un-normlised at inference time, fine-tuning Whisper on labelled data with these entities will certainly improve its performance on this distribution of data
  • The amount of data you’d need for this domain shift is low: you can fine-tune Whisper with as little as 5-10 hours of labelled audio data and significantly improve its performance on your target domain
  • There is a risk of ‘catastrophic forgetting’ here: Whisper quickly overfits on this fine-tuning set and ‘forgets’ how to generalise, but if you only care about how Whisper performs on data at deployment time this is fine → you just need to make sure your fine-tuning data is in-domain with data at deployment

You can quite feasibly fine-tune Whisper small/medium on a single V100, and Whisper medium/large on a single A100. See community-events/whisper-fine-tuning-event at main · huggingface/community-events · GitHub which checkpoint (tiny/base/small/medium/large) is a trade-off between:

  • Performance
  • Inference speed

Fine-tuning greatly reduces the performance gap between checkpoints. E.g. fine-tuning the small checkpoint on 5h of audio data will give you better performance than the pre-trained medium checkpoint, but will run 3x faster at inference time. My recommendation would be to fine-tune the small/medium checkpoints for this reason.

Here’s a blog post which explains fine-tuning from start to finish: Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers.

Supposing we don’t modify the Whisper checkpoint in any way, inverse normalising entities is straightforward: it’s a dictionary mapping from the normalised entity (EAP) to the un-normlised entity (expert acceleration programme), so we can just build a dict with the mappings accordingly

MAP_TO_WORD = {"EAP": "expert acceleration programme", ...}

Normalising entries is trickier because of Whisper’s tendency not to transcribe entities correctly (EAP/ears and peas), making it harder to build a mapping. Here, we’d need to know more about the kinds of entities to better understand what we could do to normalise

1 Like

Hi @sanchit-gandhi,

I am fine tuning whisper-large-v3 to always predict normalized text w.r.t numbers and I have around 50 hrs of english dataset. I was able to get good results by fine tuning whisper-medium, but when it comes to the large model, the model starts behaving differently. Initially the WER along with validation and training loss gradually decrease and when the training is about to converge, the losses remain low but the WER shoots up. Model generates the first few tokens accurately and then repeats the same token again and again.

I used a learning rate of 1e-5, upon which the model gets to convergence point at around 1500 steps before the training goes bad. When I use a learning rate of 1e-6, the same happens after 8000 steps.

Any ideas on how I can make fine tuning the model better?