Adding custom vocabularies on Whisper

sebasarango1180 · January 10, 2023, 5:04pm

Hi,

I’ve been conducting some ASR tests using Whisper and it shows a very decent performance, specially in English (which is my main use case). However, it sometimes fails at recognizing uncommon terms such as entities or acronyms. For instance, when a speaker says:

...I hold access to SDRs...

The transcription looks like:

...I hold access to as the ours...

Question is - how can I tune the Whisper components (i.e. tokenizer, processor, model and so on) so these specific-domain terms or acronyms get better recognized? I mean, is there a way to specify Whisper that “SDR” is a perfectly likely token? If so, what would be the best approach to handle this?

Thanks for your help!

sanchit-gandhi · January 17, 2023, 1:45pm

Hey @sebasarango1180,

If you have a corpus of paired audio-text data with examples of such terms/entities/acronyms, you could experiment with fine-tuning the Whisper model on this dataset and seeing whether this improves downstream ASR performance on this distribution of data. To do so, you can follow the blog post at Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers.

Since we’re fine-tuning for English-only (rather than multilingual), we need to make two modifications to the recipe outlined in the blog post:

Use and English-only checkpoint (e.g. small.en instead of small)
Omit the language and task args when we instantiate the processor:

processor = WhisperProcessor.from_pretrained("openai/whisper-small.en")  # previously we set language="hi" and task="transcribe" -> we omit these args for English ASR

I would provisionally try fine-tuning just the model weights, leaving the feature extractor and tokenizer as they come from the pre-trained OpenAI checkpoint. There shouldn’t be a need to change the feature extractor in any circumstance - this component simply converts the raw audio waveform to a log-Mel spectrogram (see section Load WhisperFeatureExtractor).

The tokenizer has an extensive byte-pair vocabulary of ~50k sub-word tokens that can be used to form any word in the English language. I would first try leveraging the pre-trained tokenizer from OpenAI without any modifications. Note that this tokenizer won’t have any specific terms/entities/acronyms in its vocabulary, but will be able to form them from sub-word tokens (e.g. individual characters). Whilst this might be sub-optimal in terms of predicting the expected acronyms (SDR is composed of three sub-word tokens of S, D and R), it does mean that we can leverage all of the pre-trained weights from the OpenAI model directly. As soon as we change the tokenizer, such as by adding extra vocabulary items, we change the dimensionality of our final classification layer, and thus randomly initialise some proportion of the weights.

One other reason I believe using sub-word tokens to predict acronyms should work is because sub-word tokens more closely reflect the phonetic sounds of the audio. For example, when we say “SDR”, we don’t pronounce this as a single word, bur rather say each of the letters individually (“ESS DEE ARR”). Thus, our model should be able to predict the individual tokens for each letter (S D R) when conditioned on the acoustic information.

If this fails, we can experiment with adding the vocabulary items to the tokenizer and resizing the embedding layer. Note that this approach will only work if we have a corpus of data to train on: since we randomly initialise the new embedding weights, we’ll need to train the model in order for it to generate sensible predictions.

from transformers import WhisperTokenizer, WhisperForConditionalGeneration

# load pre-trained tokenizer and model
ckpt = "openai/whisper-small.en"
tokenizer = WhisperTokenizer.from_pretrained(ckpt)
model = WhisperForConditionalGeneration.from_pretrained(ckpt)

# define new tokens to add to vocab
new_tokens = ["SDR", ...]

# check if the new tokens are already in the vocabulary
new_tokens = set(new_tokens) - set(tokenizer.vocab.keys())

# add the tokens to the tokenizer vocabulary
tokenizer.add_tokens(list(new_tokens))

# add new random embeddings for the appended tokens
model.resize_token_embeddings(len(tokenizer))

Supposing you don’t have paired audio-text data for fine-tuning, we could explore using an initial_prompt to boost the log-probs for certain vocab items, as is done in the ‘official’ OpenAI implementation. See prompt vs prefix in DecodingOptions · Discussion #117 · openai/whisper · GitHub and whisper/transcribe.py at 0f39c89d9212e4d0c64b915cf7ba3c1f0b59fecc · openai/whisper · GitHub for info.

KubilayCan · August 23, 2023, 3:42pm

Instead of adding new tokens to the vocabulary and resizing embedding layer with random initialization, would it make sense to replace least used N tokens with new tokens?

artyomboyko · December 20, 2023, 11:49am

@sanchit-gandhi Hello Sanchit. Can you clarify one thing further. How do we understand that there are new tokens in the dataset and we need to add them to the tokenizer? As far as I understand, if there are new tokens in the dataset and we will retrain the model without them, there may be problems when calculating the metric. The metric will be calculated taking into account unknown tokens. Consequently, the model will not be able to approximate the knowledge from the dataset completely.

andregn · January 12, 2024, 8:52pm

Here are 2 other approaches.
No training required, so I highly recommend trying this before fine-tuning models or changing their architecture.

1. Initial Prompt

You can simply use the parameter initial_prompt to create a bias towards your vocabulary.
In your example, you could write: "Let's talk about International Monetary Fund and SDRs."
This will encourage the model to repeat the term SDRs and other terms related to finances.

or…

2. Suppress Tokens

Sometimes whisper keeps using a wrong word. It that’s the case, you may suppress that token.

For example, let’s pretend there’s a Latin name “Esthear” and whisper transcribes to “I hold access to Esthear’s…”
Pretend this name is represented by tokens:
("Esthe", "ar") → (98765, 12345)

If you suppress the token “Esthe”, Whisper will need to come up with alternatives to transcribe your audio… And hopefully guessing “SDR” correctly.

But be careful not to suppress common tokens. If you suppress both tokens “Esthe” + “ar”, it might impact other words, like “mo-net-ar-y”, tow-ar-ds".

Code Example

initial_prompt = "Let's talk about International Monetary Fund and SDRs."
model.transcribe(audio_file, initial_prompt=initial_prompt, suppress_tokens=[98765]

jaxmetaverse · March 3, 2024, 2:14pm

if hotword is not first sentence, the initial_prompt is valid.

shiviagrawal05 · January 7, 2025, 3:13pm

Hi,

I followed this thread. I have a use-case where we have specific domains keywords which whisper is unable to identify. We first tried fine-tuning without tokeniser update with large-v2. Performance is much better, the transcription quality has improved.

Still, saw the keywords getting misinterpreted most of the times. So, we added a corpus of words, just as @sanchit-gandhi you have suggested. We then fine-tine, now the results are pretty bad here. Embedding for new token was initialised with mean of the pre-existing token embeddings.

The transcript with update tokeniser, truncates or skips wherever the new token is coming. Unable to understand what would be leading to this. How much of audio-text pairs would be needed to learn the new relation?

rabans · March 25, 2025, 6:21pm

I have not tried suppressing, but my experience with prompting is that it affects the quality of other output as well. Here is an example:

The single word messes up with other similar-sounding words. For example, I added “Artem” into the prompt, and it messes up with “startup” now.

Any other workaround that you found to use custom vocabularies?

Topic		Replies	Views
Fine-tuning whsiper on custom special tokens 🤗Tokenizers	0	93	February 16, 2025
How to update vocabulary of whisper processor 🤗Transformers	1	153	March 28, 2024
Fine Tuning Whisper on my own Dataset with a customized Tokenizer Beginners	16	12465	February 12, 2024
Modifying Whisper using Domain Specific Attention Beginners	2	868	June 15, 2025
How can I save vocab for specific language in Model Whisper? 🤗Transformers	0	282	May 29, 2023

Adding custom vocabularies on Whisper

1. Initial Prompt

2. Suppress Tokens

Code Example

Related topics