Tokenizer not created when training whisper-small model


I followed the tutorial to train a whisper-small (in my case I used the whisper-base-en) model and I was able to successfully train the model.

However, after publishing to my hub and trying to read it through the Pipeline function:

pipe = pipeline("automatic-speech-recognition", model="beeezeee/whisper-base")

I get the following error:

OSError: Can’t load tokenizer for ‘beeezeee/whisper-base’. If you were trying to load it from ‘Models - Hugging Face’, make sure you don’t have a local directory with the same name. Otherwise, make sure ‘beeezeee/whisper-base’ is the correct path to a directory containing all relevant files for a WhisperTokenizerFast tokenizer.

I am not sure what the issue here is, but it seems like my trainer never created a Tokenizer file (but from what I read, ASR is different from your regular NLP models).

@sanchit-gandhi - I feel like I have seen your name quite often in this space on this website (I followed your tutorial as well and I got the same results - no Tokenizer from the training was created).

Here are the list of files my trainer produces:

Let me know if I had could provide more information. Thanks.

Seems like the training never create the vocab.json, merge.txt and pytorch_model.bin files. Any idea why the training didn’t create those files?