RuntimeError: blank must be in label range

Hi, I’m trying to train a tamil model. I ran the code as explained in Patrick’s video. But I ran into this error. Can you help me what the reason for this?

This is my colab notebook

Here a shareable link of the notebook: https://colab.research.google.com/drive/1SSmJywEvx07TtQSRtSFpxRawzb1lnXIC?usp=sharing

The colab seems to work fine with me - it’s training when I run it: https://colab.research.google.com/drive/1NCoaTUx1ntjwO1ZgdvM0tlPFehBTBp7t?usp=sharing

you shall set the vocab_sizee in Wav2Vec2ForCTC.from_pretrained()

This also happens if the token you have selected is part of the language vocab. In hindi (or other devnagari scripts) the pipe "|" is used instead of a full-stop. So be careful to select a token which is not part of the normal language vocab

@patrickvonplaten Hi, I also have this issue, and it seems related to the new vocab size larger to the vocab size during pretraining. I suppose the model is reusing the weights of pretrained model (lm_head layer ) . is there a simple way to update the dimension (similar to model.resize_token_embeddings from language modeling model) or the method is only manually such as model.lm_head = nn.Linear(…) ?

For posterity, the vocab size is set there in the last parameter and should be the number of different characters. It is correct in the original notebook, I don’t know if it was corrrected…

model = Wav2Vec2ForCTC.from_pretrained(
    "facebook/wav2vec2-large-xlsr-53", 
    attention_dropout=0.1,
    hidden_dropout=0.1,
    feat_proj_dropout=0.0,
    mask_time_prob=0.05,
    layerdrop=0.1,
    gradient_checkpointing=True, 
    ctc_loss_reduction="mean", 
    pad_token_id=processor.tokenizer.pad_token_id,
    vocab_size=len(processor.tokenizer)
)