Tokenizer not created when training whisper-small model

Hi,

I followed the tutorial to train a whisper-small (in my case I used the whisper-base-en) model and I was able to successfully train the model.

However, after publishing to my hub and trying to read it through the Pipeline function:

pipe = pipeline("automatic-speech-recognition", model="beeezeee/whisper-base")

I get the following error:

OSError: Can’t load tokenizer for ‘beeezeee/whisper-base’. If you were trying to load it from ‘Models - Hugging Face’, make sure you don’t have a local directory with the same name. Otherwise, make sure ‘beeezeee/whisper-base’ is the correct path to a directory containing all relevant files for a WhisperTokenizerFast tokenizer.

I am not sure what the issue here is, but it seems like my trainer never created a Tokenizer file (but from what I read, ASR is different from your regular NLP models).

@sanchit-gandhi - I feel like I have seen your name quite often in this space on this website (I followed your tutorial as well and I got the same results - no Tokenizer from the training was created).

Here are the list of files my trainer produces:
image

Let me know if I had could provide more information. Thanks.

Seems like the training never create the vocab.json, merge.txt and pytorch_model.bin files. Any idea why the training didn’t create those files?

I have followed this guide and am would also like to test how my model performs on some audio. I get the exact same error. Have you found a fix?

This is what helped me.
Although this command is found in the Google Colab linked in the blog, I believe the author accidentally skipped it in the blog itself.
Before starting the training, we need to save the processor which is not trainable and hence does not change during training.

processor.save_pretrained(training_args.output_dir)

This will add the necessary files for the tokenizer and whether you use the model locally or push it to hub, it will have all the necessary files.