How can we normalize or inverse normalize certain entities when using Whisper?
- In it’s pre-trained form, Whisper is biased against normalised entities (e.g.
EAPs
would be probably transcribed as something likeear and peas
) - If you have certain entities that you expect to be normalised/un-normlised at inference time, fine-tuning Whisper on labelled data with these entities will certainly improve its performance on this distribution of data
- The amount of data you’d need for this domain shift is low: you can fine-tune Whisper with as little as 5-10 hours of labelled audio data and significantly improve its performance on your target domain
- There is a risk of ‘catastrophic forgetting’ here: Whisper quickly overfits on this fine-tuning set and ‘forgets’ how to generalise, but if you only care about how Whisper performs on data at deployment time this is fine → you just need to make sure your fine-tuning data is in-domain with data at deployment
You can quite feasibly fine-tune Whisper small/medium on a single V100, and Whisper medium/large on a single A100. See community-events/whisper-fine-tuning-event at main · huggingface/community-events · GitHub which checkpoint (tiny/base/small/medium/large) is a trade-off between:
- Performance
- Inference speed
Fine-tuning greatly reduces the performance gap between checkpoints. E.g. fine-tuning the small checkpoint on 5h of audio data will give you better performance than the pre-trained medium checkpoint, but will run 3x faster at inference time. My recommendation would be to fine-tune the small/medium checkpoints for this reason.
Here’s a blog post which explains fine-tuning from start to finish: Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers.
Supposing we don’t modify the Whisper checkpoint in any way, inverse normalising entities is straightforward: it’s a dictionary mapping from the normalised entity (EAP
) to the un-normlised entity (expert acceleration programme
), so we can just build a dict with the mappings accordingly
MAP_TO_WORD = {"EAP": "expert acceleration programme", ...}
Normalising entries is trickier because of Whisper’s tendency not to transcribe entities correctly (EAP
/ears and peas
), making it harder to build a mapping. Here, we’d need to know more about the kinds of entities to better understand what we could do to normalise