Fine-Tune Wav2Vec2 for English ASR with 🤗 Transformers article bug

Hello @patrickvonplaten!
Thank you so much for the tutorials. Super helpful!

I’ve been having troubles with setting up a CTC head on a pre-trained model with an existing CTC head and wanted to point out a possible problem in the tutorial that in part led me to having my main problem.

First, I’ll pinpoint the problem in the tutorial, and then I’ll describe the problem I am getting. I am not that concerned with my problem yet as I haven’t dived deep into the source code to actually find what is wrong. I just want to make sure I am not crazy with the tutorial thing.

So, first, we’re setting up a vocabulary of 30 tokens and loading it into a tokenizer. It is indeed 30 tokens.

Then, we proceed to load a pre-trained model with CTC head initialized at random. The thing is, the head is initialized with the size of the vocabulary being 32! And we never correct it. Are those supposed to be bos and eos? Shouldn’t we add those to the vocab?

My goal is to fine-tune a model that already has a fine-tuned CTC head to work for my data (there is no headless model). The head has more output tokens than I need. So I want to initialize it at random.

I prepare my dictionary and a tokenizer, it works just as I expect it to.
I load the model with the CTC head, swap it with my head. I add the vocab_size=len(tokenizer) and pad_token_id to the model config. I even add the bos_token_id and eos_token_id (and add those to the vocabulary, although I don’t need them)

And then several strange things tend to happen:

  1. I was going nuts about why I have the same random character appearing between the characters that should indeed be recognized (I was printing out evaluation set predictions during training).
    And then I understood that this character has id 0 in the vocab I am creating. And it is usually the id for the PAD token in wav2vec vocabs.

  2. If I set the vocab size in the model config exactly to my vocab size I get
    raise ValueError(f"Label values must be <= vocab_size: {self.config.vocab_size}")
    ValueError: Label values must be <= vocab_size: 32
    so I am forced to do model.config.vocab_size=len(tokenizer)+1 for the thing to work.

So if you know right away what is going wrong, I’ll appreciate the answer :upside_down_face:
If not, I am just going to dive deeper myself


1 Like

Well, the label error was my error with adding a special char to the vocabulary…
But the addition of a token with zero id is still under my investigation :nerd_face:

I think I found it. The problem comes from torch.nn.ctc_loss. It sets blank to zero.

Do you mean that the ctc always predict ‘blank’? I meet the same issue.What do you mean that" The problem comes from torch.nn.ctc_loss. It sets blank to zero"?