If you have a corpus of paired audio-text data with examples of such terms/entities/acronyms, you could experiment with fine-tuning the Whisper model on this dataset and seeing whether this improves downstream ASR performance on this distribution of data. To do so, you can follow the blog post at Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers.
Since we’re fine-tuning for English-only (rather than multilingual), we need to make two modifications to the recipe outlined in the blog post:
- Use and English-only checkpoint (e.g.
small.en instead of
- Omit the language and task args when we instantiate the
processor = WhisperProcessor.from_pretrained("openai/whisper-small.en") # previously we set language="hi" and task="transcribe" -> we omit these args for English ASR
I would provisionally try fine-tuning just the model weights, leaving the feature extractor and tokenizer as they come from the pre-trained OpenAI checkpoint. There shouldn’t be a need to change the feature extractor in any circumstance - this component simply converts the raw audio waveform to a log-Mel spectrogram (see section Load WhisperFeatureExtractor).
The tokenizer has an extensive byte-pair vocabulary of ~50k sub-word tokens that can be used to form any word in the English language. I would first try leveraging the pre-trained tokenizer from OpenAI without any modifications. Note that this tokenizer won’t have any specific terms/entities/acronyms in its vocabulary, but will be able to form them from sub-word tokens (e.g. individual characters). Whilst this might be sub-optimal in terms of predicting the expected acronyms (SDR is composed of three sub-word tokens of S, D and R), it does mean that we can leverage all of the pre-trained weights from the OpenAI model directly. As soon as we change the tokenizer, such as by adding extra vocabulary items, we change the dimensionality of our final classification layer, and thus randomly initialise some proportion of the weights.
One other reason I believe using sub-word tokens to predict acronyms should work is because sub-word tokens more closely reflect the phonetic sounds of the audio. For example, when we say “SDR”, we don’t pronounce this as a single word, bur rather say each of the letters individually (“ESS DEE ARR”). Thus, our model should be able to predict the individual tokens for each letter (S D R) when conditioned on the acoustic information.
If this fails, we can experiment with adding the vocabulary items to the tokenizer and resizing the embedding layer. Note that this approach will only work if we have a corpus of data to train on: since we randomly initialise the new embedding weights, we’ll need to train the model in order for it to generate sensible predictions.
from transformers import WhisperTokenizer, WhisperForConditionalGeneration
# load pre-trained tokenizer and model
ckpt = "openai/whisper-small.en"
tokenizer = WhisperTokenizer.from_pretrained(ckpt)
model = WhisperForConditionalGeneration.from_pretrained(ckpt)
# define new tokens to add to vocab
new_tokens = ["SDR", ...]
# check if the new tokens are already in the vocabulary
new_tokens = set(new_tokens) - set(tokenizer.vocab.keys())
# add the tokens to the tokenizer vocabulary
# add new random embeddings for the appended tokens
Supposing you don’t have paired audio-text data for fine-tuning, we could explore using an
initial_prompt to boost the log-probs for certain vocab items, as is done in the ‘official’ OpenAI implementation. See prompt vs prefix in DecodingOptions · Discussion #117 · openai/whisper · GitHub and whisper/transcribe.py at 0f39c89d9212e4d0c64b915cf7ba3c1f0b59fecc · openai/whisper · GitHub for info.