Inverse normalising entities in Whisper

How can we normalize or inverse normalize certain entities when using Whisper?

  • In it’s pre-trained form, Whisper is biased against normalised entities (e.g. EAPs would be probably transcribed as something like ear and peas)
  • If you have certain entities that you expect to be normalised/un-normlised at inference time, fine-tuning Whisper on labelled data with these entities will certainly improve its performance on this distribution of data
  • The amount of data you’d need for this domain shift is low: you can fine-tune Whisper with as little as 5-10 hours of labelled audio data and significantly improve its performance on your target domain
  • There is a risk of ‘catastrophic forgetting’ here: Whisper quickly overfits on this fine-tuning set and ‘forgets’ how to generalise, but if you only care about how Whisper performs on data at deployment time this is fine → you just need to make sure your fine-tuning data is in-domain with data at deployment

You can quite feasibly fine-tune Whisper small/medium on a single V100, and Whisper medium/large on a single A100. See community-events/whisper-fine-tuning-event at main · huggingface/community-events · GitHub which checkpoint (tiny/base/small/medium/large) is a trade-off between:

  • Performance
  • Inference speed

Fine-tuning greatly reduces the performance gap between checkpoints. E.g. fine-tuning the small checkpoint on 5h of audio data will give you better performance than the pre-trained medium checkpoint, but will run 3x faster at inference time. My recommendation would be to fine-tune the small/medium checkpoints for this reason.

Here’s a blog post which explains fine-tuning from start to finish: Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers.

Supposing we don’t modify the Whisper checkpoint in any way, inverse normalising entities is straightforward: it’s a dictionary mapping from the normalised entity (EAP) to the un-normlised entity (expert acceleration programme), so we can just build a dict with the mappings accordingly

MAP_TO_WORD = {"EAP": "expert acceleration programme", ...}

Normalising entries is trickier because of Whisper’s tendency not to transcribe entities correctly (EAP/ears and peas), making it harder to build a mapping. Here, we’d need to know more about the kinds of entities to better understand what we could do to normalise

