Param grad None despite model training with requires_grad=True

Hello,

I’m trying to call the HubertForCTC and fine-tune it with Librispeech 100h using CTC Loss. I notice that during training, the .grad values of the: a) model parameters (ie. self.hubert.parameters) as well as b) the output layer parameters (self.lm_head.parameters) is always None (even after several backprop updates), though requires_grad is True for all of these parameters. More confusingly, the loss is also decreasing normally and the WER is improving. Could someone explain why? Unless I am missing something, the .grad value should be set after backpropagation, is it not?

I have followed the Huggingface blog on fine-tuning Wav2Vec2 and adapted it for Hubert. I provide my train.py and my config file here. I call train.py as follows:

model_name="facebook/hubert-base-ls960"
prefix="results/hubert_debug"
config_path="<path_to_config>"
rm -rf ${DIR}/${prefix}

python3 train.py \
    --model_name $model_name --save_prefix ${prefix} \
    --num_workers 24 --language "en" \
    --trainer_config $config_path

Most importantly, this is where I print .grad of model parameters (Lines 1234-1245 of modelling_hubert.py can be replaced with this snippet for reproduction):

outputs = self.hubert(
            input_values,
            attention_mask=attention_mask,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        print (f"No. of params: {len([p for p in list(self.hubert.parameters())])}")
        print (f"No. of params with grad updated: {len([p for p in list(self.hubert.parameters()) if p.grad])}")
        print (f"No. of params with requires grad updated: {len([p for p in list(self.hubert.parameters()) if p.requires_grad])}")
        
        hidden_states = outputs[0]
        hidden_states = self.dropout(hidden_states)

        logits = self.lm_head(hidden_states)
        print (f"No. of params with grad updated in LM Head: {len([p for p in list(self.lm_head.parameters()) if p.grad])}")
        print (f"No. of params with requires grad updated in LM Head: {len([p for p in list(self.lm_head.parameters()) if p.requires_grad])}")

And I always get:

No. of params: 211
No. of params with grad updated: 0
No. of params with requires grad updated: 211
No. of params with grad updated in LM Head: 0
No. of params with requires grad updated in LM Head: 2

Can someone please explain this? Any help is appreciated. (cc @patrickvonplaten in case you have any inputs :slight_smile: )

Useful links (since the previous post unfortunately only allowed me to post 2 links):

  1. The blog I followed: Fine-Tune Wav2Vec2 for English ASR in Hugging Face with 🤗 Transformers

  2. The modelling Hubert script where I added the print lines for the .grad parameters: https://github.com/huggingface/transformers/blob/b08f41e62a41632195cb986fcc41d428a5bf1d56/src/transformers/models/hubert/modeling_hubert.py#L1234

Hope these help!