Hello @patrickvonplaten!
Thank you so much for the tutorials. Super helpful!
Iāve been having troubles with setting up a CTC head on a pre-trained model with an existing CTC head and wanted to point out a possible problem in the tutorial that in part led me to having my main problem.
First, Iāll pinpoint the problem in the tutorial, and then Iāll describe the problem I am getting. I am not that concerned with my problem yet as I havenāt dived deep into the source code to actually find what is wrong. I just want to make sure I am not crazy with the tutorial thing.
TUTORIAL
So, first, weāre setting up a vocabulary of 30 tokens and loading it into a tokenizer. It is indeed 30 tokens.
Then, we proceed to load a pre-trained model with CTC head initialized at random. The thing is, the head is initialized with the size of the vocabulary being 32! And we never correct it. Are those supposed to be bos and eos? Shouldnāt we add those to the vocab?
MY PROBLEM
My goal is to fine-tune a model that already has a fine-tuned CTC head to work for my data (there is no headless model). The head has more output tokens than I need. So I want to initialize it at random.
I prepare my dictionary and a tokenizer, it works just as I expect it to.
I load the model with the CTC head, swap it with my head. I add the vocab_size=len(tokenizer) and pad_token_id to the model config. I even add the bos_token_id and eos_token_id (and add those to the vocabulary, although I donāt need them)
And then several strange things tend to happen:
-
I was going nuts about why I have the same random character appearing between the characters that should indeed be recognized (I was printing out evaluation set predictions during training).
And then I understood that this character has id 0 in the vocab I am creating. And it is usually the id for the PAD token in wav2vec vocabs. -
If I set the vocab size in the model config exactly to my vocab size I get
raise ValueError(f"Label values must be <= vocab_size: {self.config.vocab_size}")
ValueError: Label values must be <= vocab_size: 32
so I am forced to do model.config.vocab_size=len(tokenizer)+1 for the thing to work.
So if you know right away what is going wrong, Iāll appreciate the answer
If not, I am just going to dive deeper myself
Katja