Fine-Tune Wav2Vec2 for English ASR with šŸ¤— Transformers article bug

Hello @patrickvonplaten!
Thank you so much for the tutorials. Super helpful!

Iā€™ve been having troubles with setting up a CTC head on a pre-trained model with an existing CTC head and wanted to point out a possible problem in the tutorial that in part led me to having my main problem.

First, Iā€™ll pinpoint the problem in the tutorial, and then Iā€™ll describe the problem I am getting. I am not that concerned with my problem yet as I havenā€™t dived deep into the source code to actually find what is wrong. I just want to make sure I am not crazy with the tutorial thing.

TUTORIAL
So, first, weā€™re setting up a vocabulary of 30 tokens and loading it into a tokenizer. It is indeed 30 tokens.

Then, we proceed to load a pre-trained model with CTC head initialized at random. The thing is, the head is initialized with the size of the vocabulary being 32! And we never correct it. Are those supposed to be bos and eos? Shouldnā€™t we add those to the vocab?

MY PROBLEM
My goal is to fine-tune a model that already has a fine-tuned CTC head to work for my data (there is no headless model). The head has more output tokens than I need. So I want to initialize it at random.

I prepare my dictionary and a tokenizer, it works just as I expect it to.
I load the model with the CTC head, swap it with my head. I add the vocab_size=len(tokenizer) and pad_token_id to the model config. I even add the bos_token_id and eos_token_id (and add those to the vocabulary, although I donā€™t need them)

And then several strange things tend to happen:

  1. I was going nuts about why I have the same random character appearing between the characters that should indeed be recognized (I was printing out evaluation set predictions during training).
    And then I understood that this character has id 0 in the vocab I am creating. And it is usually the id for the PAD token in wav2vec vocabs.

  2. If I set the vocab size in the model config exactly to my vocab size I get
    raise ValueError(f"Label values must be <= vocab_size: {self.config.vocab_size}")
    ValueError: Label values must be <= vocab_size: 32
    so I am forced to do model.config.vocab_size=len(tokenizer)+1 for the thing to work.

So if you know right away what is going wrong, Iā€™ll appreciate the answer :upside_down_face:
If not, I am just going to dive deeper myself

Katja

3 Likes

Well, the label error was my error with adding a special char to the vocabularyā€¦
But the addition of a token with zero id is still under my investigation :nerd_face:

I think I found it. The problem comes from torch.nn.ctc_loss. It sets blank to zero.

Do you mean that the ctc always predict ā€˜blankā€™? I meet the same issue.What do you mean that" The problem comes from torch.nn.ctc_loss. It sets blank to zero"?

Hi there, facing the same issue, did you manage to solve this?

@patrickvonplaten would appreciate your help hereā€¦

i just set pad token to 0 in the vocab :grimacing:

1 Like

did you also have to set the len of model vocab to len of tokenizer? earlier version I trained was predicting pad token for everything

I meet this issue too. But when i train model by ā€œfrom transformers import AutoModelForCTCā€, it is ok (transformers==4.22.2). This is link:

This is because CTC model can only support a-z, space, [PAD], [UNK], and several other special characters, which in total are 32 characters. If your vocab_dict size is larger than 32, CTC will not be able to represent them. Hence, you have to remove some characters from the vocabulary to make its size less than 32.

Hey, Iā€™m facing the same problem currently. Model is predicting [PAD] token only. How were you able to solve this problem? Thanks!

found any solution to this problem?

heyyy, the problem was that the ctc loss expected a pad token to have id 0 in the token vocabulary. whenever the vocabulary had something else in that position it was used by the loss as pad token. not sure if itā€™s still the case. going to actually revisit it today

1 Like

i donā€™t think this problem exists today. everything works fine with any id of pad in the vocabulary.

i tried this and itā€™s worked, thanks

For people running into this problem because their vocab_size is actually large than the model.config.vocab_size (rather than the pad id not being zero), you can fix like this:

model.config.update({
    "vocab_size": len(tokenizer),
})