Fine Tuning Whisper on my own Dataset with a customized Tokenizer


I have followed this amazing blog Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers on fine tuning whisper on my dataset and the performance is decent! However, as my dataset is in Bahasa Indonesia and my use case would be to use to as helpline phone chatbot where the users would only speak in Bahasa, I have seen some wrong predictions where the transcribed words are not Bahasa words. As Whisper is trained on a multi-lingo dataset and has translation capabilities, of which I do not really need. This got me thinking to create a new BPETokenizer that is pre trained on Bahasa words only.


After training with my new customized tokenizer, the performance of the new Whisper model is predicting gibberish and I am not sure how to debug it. Any help or directions would be very greatly appreciated.


Training Tokenizer

# training tokenizer
from tokenizers import ByteLevelBPETokenizer
tokenizer = ByteLevelBPETokenizer()

# loading tokenizer
from transformers import WhisperFeatureExtractor, WhisperTokenizer, WhisperProcessor

feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small")
old_tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-small", language="indonesian", task="transcribe")
tokenizer = WhisperTokenizer(vocab_file='indo-vocab.json',
                             bos_token= '<|endoftext|>',
                             pad_token= '<|endoftext|>',
                             model_max_length = 1024,
                            language='indonesian', task='transcribe')
processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="id", task="transcribe")

    'additional_special_tokens': old_tokenizer.special_tokens_map['additional_special_tokens']


from transformers import pipeline
pipe = pipeline(
   task = 'automatic-speech-recognition',

def transcribe(audio):
   max_duration_ms = 30000 # ms
   transcription = ''
   for i in range(len(audio) // max_duration_ms+1):
       if i == len(audio) // max_duration_ms:
           sample_audio = audio[i*max_duration_ms:]
           sample_audio = audio[i*max_duration_ms: (i+1)*max_duration_ms]
       sound_array = np.array(sample_audio.get_array_of_samples(), dtype=np.float32)/ 2**15
       text = pipe(sound_array)["text"]
       transcription += text
   return transcription

Hey @notha99y! Cool to see that you’ve been using the blog post!

The Whisper model is pre-trained on 96 languages, Indonesian being one of them. This means that the pre-trained tokenizer already has all of the Indonesian words you need! I would recommend that you leverage this pre-trained tokenizer directly rather than training a new one. Why? Because then we can also leverage all of the pre-trained Whisper weights directly! If we build a new tokenizer, we have to randomly initialise some of the Whisper weights to work with our new tokenizer, meaning we lose some of the knowledge from pre-training. If we use the pre-trained one, we can use all of the weights (and so all of the knowledge!). The Whisper model quickly learns which bit of the pre-trained tokenizer to use when fine-tuning, in your case the Indonesian part.

So I’d recommend you keep the pre-trained tokenizer, and simply set the correct language when you instantiate the processor in this line: Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers

Yes there’s a bit of redundancy in the tokenizer, but our overall performance should be better!

1 Like

Hi @sanchit-gandhi thank you for writing back! And once again thank you for the amazing blog post.

I did try to instantiate the processor with the correct language as stated in your blog, but there are occasions where Whisper transcribed my audio into Japanese. Haha.

Indeed, changing the tokenizer was a wrong move. My simple fix is to maintain the pre-trained tokenizer and scale up the training dataset 100 times. No more wrong Japanese transcription already! :tada:

Perfect! All the best with your training runs :hugs:

If i Have my own dataset in that dataset contains medical terms and its audio. In that case i want to create my own model for Speech to text conversion. so how can i approach this problem. or is there any specific dataset is already in hugging face hub.

Hey @Ajayagnes! Welcome to the HF community and thanks for posting this awesome question :hugs: It should be possible to fine-tune the Whisper model on your own dataset for medical audio/text. You can follow the steps outlined in this blog post: Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers. You’ll just need to make sure that your audio dataset is in the HF datasets format (Create an audio dataset).

If you’re interested in English-only speech-to-text, you simply need to load one of the English-only checkpoints:

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small.en")

And omit the language/task arguments when you instantiate the processor:

processor = WhisperProcessor.from_pretrained("openai/whisper-small.en")

Otherwise, for multilingual speech recognition, you can follow the blog post as is, substituting the Common Voice 11 dataset for your own dataset.

I’m not aware of any medical ASR datasets that are on the HF Hub! Maybe @polinaeterna knows of one?

Thank you for the reply sir.
I’ll do it with the above mentioned tips