Hi, I’m trying to train a tamil model. I ran the code as explained in Patrick’s video. But I ran into this error. Can you help me what the reason for this?
This is my colab notebook
Hi, I’m trying to train a tamil model. I ran the code as explained in Patrick’s video. But I ran into this error. Can you help me what the reason for this?
This is my colab notebook
Here a shareable link of the notebook: https://colab.research.google.com/drive/1SSmJywEvx07TtQSRtSFpxRawzb1lnXIC?usp=sharing
The colab seems to work fine with me - it’s training when I run it: https://colab.research.google.com/drive/1NCoaTUx1ntjwO1ZgdvM0tlPFehBTBp7t?usp=sharing
you shall set the vocab_sizee in Wav2Vec2ForCTC.from_pretrained()
This also happens if the token you have selected is part of the language vocab. In hindi (or other devnagari scripts) the pipe "|"
is used instead of a full-stop. So be careful to select a token which is not part of the normal language vocab
@patrickvonplaten Hi, I also have this issue, and it seems related to the new vocab size larger to the vocab size during pretraining. I suppose the model is reusing the weights of pretrained model (lm_head layer ) . is there a simple way to update the dimension (similar to model.resize_token_embeddings from language modeling model) or the method is only manually such as model.lm_head = nn.Linear(…) ?
For posterity, the vocab size is set there in the last parameter and should be the number of different characters. It is correct in the original notebook, I don’t know if it was corrrected…
model = Wav2Vec2ForCTC.from_pretrained(
"facebook/wav2vec2-large-xlsr-53",
attention_dropout=0.1,
hidden_dropout=0.1,
feat_proj_dropout=0.0,
mask_time_prob=0.05,
layerdrop=0.1,
gradient_checkpointing=True,
ctc_loss_reduction="mean",
pad_token_id=processor.tokenizer.pad_token_id,
vocab_size=len(processor.tokenizer)
)
Hi @Shiro,
I encountered the same issue. Did you solve this problem?
Thank you in advance and looking forward to hearing from you.
Best regards