Fine Tuning Whisper on my own Dataset with a customized Tokenizer

Background

I have followed this amazing blog Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers on fine tuning whisper on my dataset and the performance is decent! However, as my dataset is in Bahasa Indonesia and my use case would be to use to as helpline phone chatbot where the users would only speak in Bahasa, I have seen some wrong predictions where the transcribed words are not Bahasa words. As Whisper is trained on a multi-lingo dataset and has translation capabilities, of which I do not really need. This got me thinking to create a new BPETokenizer that is pre trained on Bahasa words only.

Problem

After training with my new customized tokenizer, the performance of the new Whisper model is predicting gibberish and I am not sure how to debug it. Any help or directions would be very greatly appreciated.

Code

Training Tokenizer

# training tokenizer
from tokenizers import ByteLevelBPETokenizer
tokenizer = ByteLevelBPETokenizer()
tokenizer.train(files=['indo_corpus.txt'],
                    min_frequency=2,
                   )

# loading tokenizer
from transformers import WhisperFeatureExtractor, WhisperTokenizer, WhisperProcessor

feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small")
old_tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-small", language="indonesian", task="transcribe")
tokenizer = WhisperTokenizer(vocab_file='indo-vocab.json',
                            merges_file='indo-merges.txt',
                             unk_token='',
                             bos_token= '<|endoftext|>',
                             pad_token= '<|endoftext|>',
                             model_max_length = 1024,
                            language='indonesian', task='transcribe')
processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="id", task="transcribe")

tokenizer.add_special_tokens({
    'additional_special_tokens': old_tokenizer.special_tokens_map['additional_special_tokens']
})

Inference

from transformers import pipeline
pipe = pipeline(
   task = 'automatic-speech-recognition',
   model='checkpoint-4000/',
   tokenizer=tokenizer,
   device=0) 

def transcribe(audio):
   max_duration_ms = 30000 # ms
   transcription = ''
   for i in range(len(audio) // max_duration_ms+1):
       if i == len(audio) // max_duration_ms:
           sample_audio = audio[i*max_duration_ms:]
       else:
           sample_audio = audio[i*max_duration_ms: (i+1)*max_duration_ms]
       sound_array = np.array(sample_audio.get_array_of_samples(), dtype=np.float32)/ 2**15
       text = pipe(sound_array)["text"]
       transcription += text
   return transcription

Hey @notha99y! Cool to see that you’ve been using the blog post!

The Whisper model is pre-trained on 96 languages, Indonesian being one of them. This means that the pre-trained tokenizer already has all of the Indonesian words you need! I would recommend that you leverage this pre-trained tokenizer directly rather than training a new one. Why? Because then we can also leverage all of the pre-trained Whisper weights directly! If we build a new tokenizer, we have to randomly initialise some of the Whisper weights to work with our new tokenizer, meaning we lose some of the knowledge from pre-training. If we use the pre-trained one, we can use all of the weights (and so all of the knowledge!). The Whisper model quickly learns which bit of the pre-trained tokenizer to use when fine-tuning, in your case the Indonesian part.

So I’d recommend you keep the pre-trained tokenizer, and simply set the correct language when you instantiate the processor in this line: Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers

Yes there’s a bit of redundancy in the tokenizer, but our overall performance should be better!

1 Like

Hi @sanchit-gandhi thank you for writing back! And once again thank you for the amazing blog post.

I did try to instantiate the processor with the correct language as stated in your blog, but there are occasions where Whisper transcribed my audio into Japanese. Haha.

Indeed, changing the tokenizer was a wrong move. My simple fix is to maintain the pre-trained tokenizer and scale up the training dataset 100 times. No more wrong Japanese transcription already! :tada:

Perfect! All the best with your training runs :hugs:

If i Have my own dataset in that dataset contains medical terms and its audio. In that case i want to create my own model for Speech to text conversion. so how can i approach this problem. or is there any specific dataset is already in hugging face hub.

Hey @Ajayagnes! Welcome to the HF community and thanks for posting this awesome question :hugs: It should be possible to fine-tune the Whisper model on your own dataset for medical audio/text. You can follow the steps outlined in this blog post: Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers. You’ll just need to make sure that your audio dataset is in the HF datasets format (Create an audio dataset).

If you’re interested in English-only speech-to-text, you simply need to load one of the English-only checkpoints:

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small.en")

And omit the language/task arguments when you instantiate the processor:

processor = WhisperProcessor.from_pretrained("openai/whisper-small.en")

Otherwise, for multilingual speech recognition, you can follow the blog post as is, substituting the Common Voice 11 dataset for your own dataset.

I’m not aware of any medical ASR datasets that are on the HF Hub! Maybe @polinaeterna knows of one?

Thank you for the reply sir.
I’ll do it with the above mentioned tips

@sanchit-gandhi can you please explain why do we need to add audio dataset in HF? Does hugging face preprocess audio and text in any format to fine tune the model or we can maintain the audio and text in local and fine tune?

Is there any standard format such as pitch, speed for audio and format such as lower cases, normalized text for transcript?

I have my own audio datasets and would like to know that we should push this to hugging face or not? Shall I use text to speech synthesis and audio augmentation to fine tune whisper ?

@sanchit-gandhi I am using my custom dataset . Now , while running [trainer.train()] , i am getting error as [
RuntimeError: The size of tensor a (517) must match the size of tensor b (448) at non-singleton dimension 1]

Please note I am using whisper medium model as seen below :

from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
output_dir=“./whisper-medium-hi”, # change to a repo name of your choice
per_device_train_batch_size=2,
gradient_accumulation_steps=16, # increase by 2x for every 2x decrease in batch size
learning_rate=1e-5,
warmup_steps=500,
max_steps=4000,
gradient_checkpointing=True,
fp16=True,
evaluation_strategy=“steps”,
per_device_eval_batch_size=8,
predict_with_generate=True,
generation_max_length=225,
save_steps=1000,
eval_steps=1000,
logging_steps=25,
report_to=[“tensorboard”],
load_best_model_at_end=True,
metric_for_best_model=“wer”,
greater_is_better=False,
push_to_hub=True,
)

Below is the error stacktrace :

use_cache = True is incompatible with gradient checkpointing. Setting use_cache = False
[1001/4000 1:31:27 < 4:34:34, 0.18 it/s, Epoch 1.76/8]
Step Training Loss Validation Loss
[151/228 11:43 < 06:01, 0.21 it/s]

RuntimeError Traceback (most recent call last)
in <cell line: 1>()
----> 1 trainer.train()

15 frames
/usr/local/lib/python3.10/dist-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1578 # Disable progress bars when uploading models during checkpoints to avoid polluting stdout
1579 hf_hub_utils.disable_progress_bars()
→ 1580 return inner_training_loop(
1581 args=args,
1582 resume_from_checkpoint=resume_from_checkpoint,

/usr/local/lib/python3.10/dist-packages/transformers/trainer.py in _inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
1980 self.control = self.callback_handler.on_step_end(args, self.state, self.control)
1981
→ 1982 self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
1983 else:
1984 self.control = self.callback_handler.on_substep_end(args, self.state, self.control)

/usr/local/lib/python3.10/dist-packages/transformers/trainer.py in _maybe_log_save_evaluate(self, tr_loss, model, trial, epoch, ignore_keys_for_eval)
2324 metrics.update(dataset_metrics)
2325 else:
→ 2326 metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
2327 self._report_to_hp_search(trial, self.state.global_step, metrics)
2328

/usr/local/lib/python3.10/dist-packages/transformers/trainer_seq2seq.py in evaluate(self, eval_dataset, ignore_keys, metric_key_prefix, **gen_kwargs)
163 self._gen_kwargs = gen_kwargs
164
→ 165 return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
166
167 def predict(

/usr/local/lib/python3.10/dist-packages/transformers/trainer.py in evaluate(self, eval_dataset, ignore_keys, metric_key_prefix)
3062
3063 eval_loop = self.prediction_loop if self.args.use_legacy_prediction_loop else self.evaluation_loop
→ 3064 output = eval_loop(
3065 eval_dataloader,
3066 description=“Evaluation”,

/usr/local/lib/python3.10/dist-packages/transformers/trainer.py in evaluation_loop(self, dataloader, description, prediction_loss_only, ignore_keys, metric_key_prefix)
3251
3252 # Prediction step
→ 3253 loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
3254 main_input_name = getattr(self.model, “main_input_name”, “input_ids”)
3255 inputs_decode = self._prepare_input(inputs[main_input_name]) if args.include_inputs_for_metrics else None

/usr/local/lib/python3.10/dist-packages/transformers/trainer_seq2seq.py in prediction_step(self, model, inputs, prediction_loss_only, ignore_keys, **gen_kwargs)
310 if has_labels:
311 with self.compute_loss_context_manager():
→ 312 outputs = model(**inputs)
313 if self.label_smoother is not None:
314 loss = self.label_smoother(outputs, inputs[“labels”]).mean().detach()

/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
→ 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = ,

/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py in forward(*args, **kwargs)
634
635 def forward(*args, **kwargs):
→ 636 return model_forward(*args, **kwargs)
637
638 # To act like a decorator so that it can be popped when doing extract_model_from_parallel

/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py in call(self, *args, **kwargs)
622
623 def call(self, *args, **kwargs):
→ 624 return convert_to_fp32(self.model_forward(*args, **kwargs))
625
626 def getstate(self):

/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py in decorate_autocast(*args, **kwargs)
12 def decorate_autocast(*args, **kwargs):
13 with autocast_instance:
—> 14 return func(*args, **kwargs)
15 decorate_autocast.__script_unsupported = ‘@autocast() decorator is not supported in script mode’ # type: ignore[attr-defined]
16 return decorate_autocast

/usr/local/lib/python3.10/dist-packages/transformers/models/whisper/modeling_whisper.py in forward(self, input_features, attention_mask, decoder_input_ids, decoder_attention_mask, head_mask, decoder_head_mask, cross_attn_head_mask, encoder_outputs, past_key_values, decoder_inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
1484 )
1485
→ 1486 outputs = self.model(
1487 input_features,
1488 attention_mask=attention_mask,

/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
→ 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = ,

/usr/local/lib/python3.10/dist-packages/transformers/models/whisper/modeling_whisper.py in forward(self, input_features, attention_mask, decoder_input_ids, decoder_attention_mask, head_mask, decoder_head_mask, cross_attn_head_mask, encoder_outputs, past_key_values, decoder_inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict)
1360
1361 # decoder outputs consists of (dec_features, past_key_value, dec_hidden, dec_attn)
→ 1362 decoder_outputs = self.decoder(
1363 input_ids=decoder_input_ids,
1364 attention_mask=decoder_attention_mask,

/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
→ 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = ,

/usr/local/lib/python3.10/dist-packages/transformers/models/whisper/modeling_whisper.py in forward(self, input_ids, attention_mask, encoder_hidden_states, head_mask, cross_attn_head_mask, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict)
1124 positions = self.embed_positions(inputs_embeds, past_key_values_length=past_key_values_length)
1125
→ 1126 hidden_states = inputs_embeds + positions
1127 hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)
1128

RuntimeError: The size of tensor a (517) must match the size of tensor b (448) at non-singleton dimension 1

Looks to be the same issue as: [Open-to-the-community] Whisper fine-tuning event - #21 by sanchit-gandhi

Thank you this was helpful .
@sanchit-gandhi : One more question - how do i save the model as pickel file “pytorch_model.bin”.
When i push the model to the repo using “trainer.push_to_hub(**kwargs)” [ex:CKSINGH/whisper-small-hi-firefox] , i dont see the pickel file pushed along side .
How , i can save the pickle file . I would need it to integrate it to a LM .

Hey @CKSINGH - the model is saved in safetensors format by default, as we can see here: model.safetensors · CKSINGH/whisper-small-hi-firefox at main

PyTorch pickle files are inherently unsafe, as explained here: 🐶Safetensors audited as really safe and becoming the default

safetensors bypasses these security flaws. The safetensors weights can be loaded directly into the model using from_pretrained, i.e.:

# Load model directly
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq

processor = AutoProcessor.from_pretrained("CKSINGH/whisper-small-hi-firefox")
model = AutoModelForSpeechSeq2Seq.from_pretrained("CKSINGH/whisper-small-hi-firefox")

If you really need to save it without safetensors, you can pass safe_serialization=False when you save:

# Load model directly
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq

processor = AutoProcessor.from_pretrained("CKSINGH/whisper-small-hi-firefox")

# load as model.safetensors
model = AutoModelForSpeechSeq2Seq.from_pretrained("CKSINGH/whisper-small-hi-firefox")

# save as pytorch_model.bin
model.push_to_hub("CKSINGH/whisper-small-hi-firefox", safe_serialization=False)

Can you please help me to fix the following error i am trying to run this on the data i have
refer the code

df_train = df[:400]
df_test = df[400:]
df_train['Audio_id'] = df_train['Audio_id'].map(lambda x: prefix_path + x)
df_test['Audio_id'] = df_test['Audio_id'].map(lambda x: prefix_path + x)
Data = DatasetDict()
Data['train'] = Dataset.from_dict({'audio' : df_train['Audio_id'], 'transcripts':df_train['input_text']}).cast_column("audio", Audio())
Data['test'] = Dataset.from_dict({'audio' : df_test['Audio_id'], 'transcripts':df_test['input_text']}).cast_column("audio", Audio())

These are the training arguments

from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="./whisper-small-hi",  # change to a repo name of your choice
    per_device_train_batch_size=8,
    gradient_accumulation_steps=1,  # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-5,
    warmup_steps=25,
    num_train_epochs=5,
    max_steps=50,
    gradient_checkpointing=True,
    evaluation_strategy="steps",
    per_device_eval_batch_size=4,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=400,
    eval_steps=400,
    logging_steps=5,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
)

Now i am getting this error

please help

Hi @sanchit-gandhi

I fine-tuned the model data. It seems to be working good. But when I want to use it with whisper live for which I need to convert it using ct2-transformers-converter
which is expecting tokenizer.json which I do not see in the fine tuned files. Can I use the original file in the base model.
Or I missed anything that do not generate the file while fine tunning

Kindly advise

I ran into the exact same issue. I was simply inputting one large recording into my training data. Whisper Requires 30 second [or less because you will pad the remaining seconds using the collator] .wav audio with 16KHz sampling rate (can be 32bit floating or 16bit floating point). I met the criteria but I had one outlier which was a 70s audio clip which I did not notice in my preprocessing. I manually removed it and the issue was fixed.

If you are a using a collator such as Fine-Tune Whisper For Multilingual ASR with :hugs: Transformers, add the following debugging print statements just before you return batch using that collator code:

print(“Batch input features shape:”, batch[“input_features”].shape)
print(“Batch labels shape:”, labels.shape)

It will give you a better conceptual view of the average amount of weights you have per step. You will definitely see an outlier if you have a large audio file.

@sanchit-gandhi hello, i try to train bpe tokenizer on pashto language but its not working. it showing garbage text. however i try to fine tune model (ASR) for pashto and it WER is 46. so can you suggent any other way how to solve this tokenization issues! thank you. :hugs:

Instead of creating a new tokenizer, merging the current tokenizer version with the one trained on the data will likely bring about some improvements.