Fine-Tune Wav2Vec2 for English ASR with 🤗 Transformers article bug

katilda · October 1, 2021, 2:28pm

Hello @patrickvonplaten!
Thank you so much for the tutorials. Super helpful!

I’ve been having troubles with setting up a CTC head on a pre-trained model with an existing CTC head and wanted to point out a possible problem in the tutorial that in part led me to having my main problem.

First, I’ll pinpoint the problem in the tutorial, and then I’ll describe the problem I am getting. I am not that concerned with my problem yet as I haven’t dived deep into the source code to actually find what is wrong. I just want to make sure I am not crazy with the tutorial thing.

TUTORIAL
So, first, we’re setting up a vocabulary of 30 tokens and loading it into a tokenizer. It is indeed 30 tokens.

Then, we proceed to load a pre-trained model with CTC head initialized at random. The thing is, the head is initialized with the size of the vocabulary being 32! And we never correct it. Are those supposed to be bos and eos? Shouldn’t we add those to the vocab?

MY PROBLEM
My goal is to fine-tune a model that already has a fine-tuned CTC head to work for my data (there is no headless model). The head has more output tokens than I need. So I want to initialize it at random.

I prepare my dictionary and a tokenizer, it works just as I expect it to.
I load the model with the CTC head, swap it with my head. I add the vocab_size=len(tokenizer) and pad_token_id to the model config. I even add the bos_token_id and eos_token_id (and add those to the vocabulary, although I don’t need them)

And then several strange things tend to happen:

I was going nuts about why I have the same random character appearing between the characters that should indeed be recognized (I was printing out evaluation set predictions during training).
And then I understood that this character has id 0 in the vocab I am creating. And it is usually the id for the PAD token in wav2vec vocabs.
If I set the vocab size in the model config exactly to my vocab size I get
raise ValueError(f"Label values must be <= vocab_size: {self.config.vocab_size}")
ValueError: Label values must be <= vocab_size: 32
so I am forced to do model.config.vocab_size=len(tokenizer)+1 for the thing to work.

So if you know right away what is going wrong, I’ll appreciate the answer
If not, I am just going to dive deeper myself

Katja

katilda · October 1, 2021, 7:22pm

Well, the label error was my error with adding a special char to the vocabulary…
But the addition of a token with zero id is still under my investigation

katilda · October 1, 2021, 10:55pm

I think I found it. The problem comes from torch.nn.ctc_loss. It sets blank to zero.

zzuczy · January 11, 2022, 1:27pm

Do you mean that the ctc always predict ‘blank’? I meet the same issue.What do you mean that" The problem comes from torch.nn.ctc_loss. It sets blank to zero"?

nickmuchi · August 26, 2022, 2:58pm

Hi there, facing the same issue, did you manage to solve this?

nickmuchi · August 29, 2022, 10:55am

@patrickvonplaten would appreciate your help here…

katilda · August 30, 2022, 8:09am

i just set pad token to 0 in the vocab

nickmuchi · September 1, 2022, 7:10am

did you also have to set the len of model vocab to len of tokenizer? earlier version I trained was predicting pad token for everything

cuongnt · October 3, 2022, 2:51am

I meet this issue too. But when i train model by “from transformers import AutoModelForCTC”, it is ok (transformers==4.22.2). This is link:

canlinzhang · November 26, 2022, 9:16pm

This is because CTC model can only support a-z, space, [PAD], [UNK], and several other special characters, which in total are 32 characters. If your vocab_dict size is larger than 32, CTC will not be able to represent them. Hence, you have to remove some characters from the vocabulary to make its size less than 32.

bhavitvyamalik · June 10, 2023, 4:05pm

Hey, I’m facing the same problem currently. Model is predicting [PAD] token only. How were you able to solve this problem? Thanks!

seba3y · August 14, 2023, 2:54pm

found any solution to this problem?

katilda · August 17, 2023, 11:04am

heyyy, the problem was that the ctc loss expected a pad token to have id 0 in the token vocabulary. whenever the vocabulary had something else in that position it was used by the loss as pad token. not sure if it’s still the case. going to actually revisit it today

katilda · August 22, 2023, 10:46am

i don’t think this problem exists today. everything works fine with any id of pad in the vocabulary.

seba3y · August 23, 2023, 12:32pm

i tried this and it’s worked, thanks

colerobertson · March 7, 2024, 9:48pm

For people running into this problem because their vocab_size is actually large than the model.config.vocab_size (rather than the pad id not being zero), you can fix like this:

model.config.update({
    "vocab_size": len(tokenizer),
})

Topic		Replies	Views
Wav2Vec2 ASR Fine tuneing Improvement Beginners	0	174	November 7, 2023
Finetuning wav2vec2-large-xlsr-53 only outputs blank labels Models	6	1220	March 29, 2023
Improving performance of Wav2Vec2 fine tuning with word piece vocabulary Research	5	2996	October 27, 2021
Wav2vec2 not converging when finetuning 🤗Transformers	7	2542	June 15, 2021
Vocabulary count mismatch when loading the previously created tokenizer 🤗Transformers	0	168	January 8, 2024

Fine-Tune Wav2Vec2 for English ASR with 🤗 Transformers article bug

Related topics