Tokenizer not created when training whisper-small model

beeezeee · November 11, 2023, 1:17am

Hi,

I followed the tutorial to train a whisper-small (in my case I used the whisper-base-en) model and I was able to successfully train the model.

However, after publishing to my hub and trying to read it through the Pipeline function:

pipe = pipeline("automatic-speech-recognition", model="beeezeee/whisper-base")

I get the following error:

OSError: Can’t load tokenizer for ‘beeezeee/whisper-base’. If you were trying to load it from ‘Models - Hugging Face’, make sure you don’t have a local directory with the same name. Otherwise, make sure ‘beeezeee/whisper-base’ is the correct path to a directory containing all relevant files for a WhisperTokenizerFast tokenizer.

I am not sure what the issue here is, but it seems like my trainer never created a Tokenizer file (but from what I read, ASR is different from your regular NLP models).

@sanchit-gandhi - I feel like I have seen your name quite often in this space on this website (I followed your tutorial as well and I got the same results - no Tokenizer from the training was created).

Here are the list of files my trainer produces:

Let me know if I had could provide more information. Thanks.

beeezeee · November 11, 2023, 8:33pm

Seems like the training never create the vocab.json, merge.txt and pytorch_model.bin files. Any idea why the training didn’t create those files?

Cameron-oos · January 19, 2024, 2:40pm

I have followed this guide and am would also like to test how my model performs on some audio. I get the exact same error. Have you found a fix?

vishakha-lall · February 15, 2024, 5:36am

This is what helped me.
Although this command is found in the Google Colab linked in the blog, I believe the author accidentally skipped it in the blog itself.
Before starting the training, we need to save the processor which is not trainable and hence does not change during training.

processor.save_pretrained(training_args.output_dir)

This will add the necessary files for the tokenizer and whether you use the model locally or push it to hub, it will have all the necessary files.

nidhistrive · July 5, 2024, 8:36am

Thanks.
Did anyone try this @vishakha-lall’s solution? Did it work?

amin1123 · November 10, 2024, 4:19pm

I can confirm that, yes, that worked. However, I wasn’t clear which output directory to use. It turns out you use the same name as your model (<user>/<model>). For example, drawing the model name from an earlier message in this thread:

processor.save_pretrained("beeezeee/whisper-base")

Topic		Replies	Views
Can't load tokenizer Beginners	1	2358	April 6, 2024
Fine tuning whisper for ASR Beginners	0	448	July 13, 2023
Sshleifer/student_blarge_12_3 does not have a tokenizer_config.json file Model cards	6	1755	May 11, 2021
Error loading tokenizer from local checkpoint directory 🤗Tokenizers	3	1595	May 13, 2024
Error when fine tuning whisper for Hausa language 🤗Hub	0	308	October 23, 2023

Tokenizer not created when training whisper-small model

Related topics