Adding custom vocabularies on Whisper

andregn · January 12, 2024, 8:52pm

Here are 2 other approaches.
No training required, so I highly recommend trying this before fine-tuning models or changing their architecture.

1. Initial Prompt

You can simply use the parameter initial_prompt to create a bias towards your vocabulary.
In your example, you could write: "Let's talk about International Monetary Fund and SDRs."
This will encourage the model to repeat the term SDRs and other terms related to finances.

or…

2. Suppress Tokens

Sometimes whisper keeps using a wrong word. It that’s the case, you may suppress that token.

For example, let’s pretend there’s a Latin name “Esthear” and whisper transcribes to “I hold access to Esthear’s…”
Pretend this name is represented by tokens:
("Esthe", "ar") → (98765, 12345)

If you suppress the token “Esthe”, Whisper will need to come up with alternatives to transcribe your audio… And hopefully guessing “SDR” correctly.

But be careful not to suppress common tokens. If you suppress both tokens “Esthe” + “ar”, it might impact other words, like “mo-net-ar-y”, tow-ar-ds".

Code Example

initial_prompt = "Let's talk about International Monetary Fund and SDRs."
model.transcribe(audio_file, initial_prompt=initial_prompt, suppress_tokens=[98765]

Topic		Replies	Views
Korean finetuning on Whisper Beginners	1	1560	February 25, 2024
Modifying Whisper using Domain Specific Attention Beginners	1	791	June 17, 2024
Fine-tuning whsiper on custom special tokens 🤗Tokenizers	0	63	February 16, 2025
Fine Tuning Whisper on my own Dataset with a customized Tokenizer Beginners	16	12246	February 12, 2024
Whisper model fine tuning Models	7	2336	June 8, 2024

Adding custom vocabularies on Whisper

1. Initial Prompt

2. Suppress Tokens

Code Example

Related topics