RuntimeError: blank must be in label range

Hi, I’m trying to train a tamil model. I ran the code as explained in Patrick’s video. But I ran into this error. Can you help me what the reason for this?

This is my colab notebook

Here a shareable link of the notebook:

The colab seems to work fine with me - it’s training when I run it:

you shall set the vocab_sizee in Wav2Vec2ForCTC.from_pretrained()

This also happens if the token you have selected is part of the language vocab. In hindi (or other devnagari scripts) the pipe "|" is used instead of a full-stop. So be careful to select a token which is not part of the normal language vocab

@patrickvonplaten Hi, I also have this issue, and it seems related to the new vocab size larger to the vocab size during pretraining. I suppose the model is reusing the weights of pretrained model (lm_head layer ) . is there a simple way to update the dimension (similar to model.resize_token_embeddings from language modeling model) or the method is only manually such as model.lm_head = nn.Linear(…) ?

For posterity, the vocab size is set there in the last parameter and should be the number of different characters. It is correct in the original notebook, I don’t know if it was corrrected…

model = Wav2Vec2ForCTC.from_pretrained(