Finetuning wav2vec2-large-xlsr-53 only outputs blank labels

Hi,

When I try to finetune wav2vec2-large-xlsr-53 with FSC dataset for ASR using built-in class Wav2VecForCTC, the CTC loss is not converging, and the system only outputs blank labels even for training instances.

Here is the log of training (overfiting) with only 8 training instances in total:

Epoch 60/1000, Batch 1/1, Total Step = 60, Loss = 26.296, CER = 100.000, 
Gold: ['SWITCH OFF THE LIGHTS', 'TURN THE VOLUME UP']
Pred: ['', '']
Epoch 1000/1000, Batch 1/1, Total Step = 1000, Loss = 2.663, CER = 100.000
Gold: ['SWITCH OFF THE LIGHTS', 'TURN THE VOLUME UP']
Pred: ['', '']

We can see that even at Epcoh 60, the CTC loss is ~26 and the system only outputs blank labels for training instances. Continuing training to 1000 epoch will reduce the CTC loss, but still the system outputs blanks.

However, if I use an ASR finetuned model (the model is finetuned even on Chinese corpora), with exactly the same code, continuing finetuning on FSC can quickly overfit the training instances, And we can see now the CTC loss is small and we can reproduce input pretty well:

Epoch 37/1000, Batch 1/1, Total Step = 37, Loss = 0.681, CER = 21.818, 
['CHANGE LANGUAGE', 'TURN THE LIGHTS ON']
['CHANE LANGUAEEI', 'TURN THE LITSH ONT']
Epoch 100/1000, Batch 1/1, Total Step = 100, Loss = 0.028, CER = 2.727, 
['SWITCH OFF THE LIGHTS', 'SWITCH ON THE LIGHTS']
['SWITCH OFF THE LIGHTS', 'SWIITCH ON THE LIGHTSWW']

Here is the code to create new model loading different pretrained models:

model = Wav2Vec2ForCTC.from_pretrained(
            args.audio_model,
            gradient_checkpointing=True,
            apply_spec_augment=False,
            vocab_size=processor.tokenizer.vocab_size,
            hidden_dropout=0.05,
            activation_dropout=0.05,
            feat_proj_dropout=0.05,
            layerdrop=0.05,
            final_dropout=0.05,
            mask_time_prob=0.05,
            ctc_loss_reduction='mean',
            ctc_zero_infinity=True,
        )

I am using Adam with 1e-4 as learning rate. Both models use the same vocabulary of size ~3k (with Chinese chars). This configuration is exactly the same for both pretrained models, but still yields different behaviors. Note that I also tried sum for ctc_loss_reduction in xlsr but also got the same blanks.

Could anybody help me on that? Thank you very much! :slight_smile:

I meet the same issue. Do you solve it???

Hey @zzuczy and @zhu-y11 - could you maybe post a link to your repos which include:

  • model config
  • training script
  • tokenizer config

on the Hub (https://huggingface.co/) so that I can inspect the files?

Sorry for not reply in time and thanks for your help! For the reason that hard to explain, it’s difficult for me to upload all resource files to Huggingface Hub. This may take some time, I’ll do it as soon as possible.
Thanks again for your willingness to help!

The line is here debug link
The data I use is Aishell-1, which is a Chinese ASR coupus.

the pre-trained model is pretrained model

btw, it seems that ctc just predict blank is a frequent problem,here are some examples raised by others
ctc problem
ctc problem
ctc problem

Some answers said that ctc learn blank firstly and then turn to learn commot characters. I don’t know whether such interpretation is ture or not.

perhaps, in fact, I use a huge vocab including 8000+ Chinese characters and train data only have 170h,the model was hard to be trained?

UODATE:
I changed the pre-training modelfacebook/wav2vec2-base-100k-voxpopuli from to facebook/wav2vec2-base. It works. Everything is ok. Maybe there is something wrong in facebook/wav2vec2-base-100k-voxpopuli?