Whisper model fine tuning

There is an excellent blog on fine-tuning whisper model on hindi (Devanagari) dataset by @sanchit-gandhi .
I need to fine-tune whisper model on the hinglish dataset (mix of english and hindi). The supervised data contains hinglish annotations (not Devanagari).
Is there a way to fine-tune whisper model on this dataset? Can I replace the model tokenizer with my custom tokenizer to proceed with fine-tuning process?
Please suggest a way to proceed on the task.

Hey @Ankit-Kumar-Saini!

To clarify, does your dataset contain Hindi characters and Roman ones (i.e. the letters a-z)? Just Hindi ones? Or just Roman ones (a-z)?

Likelihood is we won’t need a new tokenizer - the Whisper tokenizer has the Hindi alphabet and Roman alphabet among others. It’s just how we initialise the tokenizer that changes.

FYI this is the update blog post: Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers

Dataset contains only roman ones (i.e. the letters a-z).
For example: “aapka age kitna hai”.
How should I tokenize this text?

Okay! I probably wound’t build a new tokeniser - I would first try using the existing Whisper one as it contains the entire Roman alphabet and more words in word-piece form.

You need to change this part of the script:

I would first try omitting the language and task code:

processor = WhisperProcessor.from_pretrained("openai/whisper-small")

If you train for long enough, the model should learn the correct output alphabet.

Hi @sanchit-gandhi, Thanks for your blog post, it was excellent but i am facing issue in fine tuning the whisper model on my data and need your support. I have audio files and corresponding meta-data csv file with two columns, one is wav_filename which contains the path of audios and corresponding text in second column as transcript and i am getting below issue while running the code: FileNotFoundError: [Errno 2] No such file or directory: ‘wav_filename’. It will be great help. Thanks.

Hey @Deveshp! Welcome to the HF forum and thanks for asking a great question! :hugs: Would you mind opening a new forum post for your question and we can shift the discussion there?

The reason being that it makes searching for issues much easier if we keep each forum post related to one question. This way, if someone has the same issue as you later down the line they can sift through all the previous issues and hopefully find our discussion! Thanks!

Hey Ankit, I know I am replying to a two-year-old post but this is exactly what I needed and Google gave me this link. Have you found the solution, what was the WER and how long did you train? Most research and ChatGPT suggest it’s better to train on Devnagri but the base model of even Whisper 3 for multi-lingual is so bad that I am not optimistic anymore. Preparing the data going to consume so much time that I need to be sure whether it will be worth it LOL.