Hello,
I’m trying to call the HubertForCTC and fine-tune it with Librispeech 100h using CTC Loss. I notice that during training, the .grad
values of the: a) model parameters (ie. self.hubert.parameters
) as well as b) the output layer parameters (self.lm_head.parameters
) is always None (even after several backprop updates), though requires_grad
is True for all of these parameters. More confusingly, the loss is also decreasing normally and the WER is improving. Could someone explain why? Unless I am missing something, the .grad
value should be set after backpropagation, is it not?
I have followed the Huggingface blog on fine-tuning Wav2Vec2 and adapted it for Hubert. I provide my train.py and my config file here. I call train.py
as follows:
model_name="facebook/hubert-base-ls960"
prefix="results/hubert_debug"
config_path="<path_to_config>"
rm -rf ${DIR}/${prefix}
python3 train.py \
--model_name $model_name --save_prefix ${prefix} \
--num_workers 24 --language "en" \
--trainer_config $config_path
Most importantly, this is where I print .grad of model parameters (Lines 1234-1245 of modelling_hubert.py can be replaced with this snippet for reproduction):
outputs = self.hubert(
input_values,
attention_mask=attention_mask,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
print (f"No. of params: {len([p for p in list(self.hubert.parameters())])}")
print (f"No. of params with grad updated: {len([p for p in list(self.hubert.parameters()) if p.grad])}")
print (f"No. of params with requires grad updated: {len([p for p in list(self.hubert.parameters()) if p.requires_grad])}")
hidden_states = outputs[0]
hidden_states = self.dropout(hidden_states)
logits = self.lm_head(hidden_states)
print (f"No. of params with grad updated in LM Head: {len([p for p in list(self.lm_head.parameters()) if p.grad])}")
print (f"No. of params with requires grad updated in LM Head: {len([p for p in list(self.lm_head.parameters()) if p.requires_grad])}")
And I always get:
No. of params: 211
No. of params with grad updated: 0
No. of params with requires grad updated: 211
No. of params with grad updated in LM Head: 0
No. of params with requires grad updated in LM Head: 2
Can someone please explain this? Any help is appreciated. (cc @patrickvonplaten in case you have any inputs )