BERTology compute_heads_importance without zero grad

ricardo-moura · October 7, 2020, 12:50am

Hello,
analyzing the code from https://github.com/huggingface/transformers/blob/master/examples/bertology/run_bertology.py I could understand that the head_importance variable is being incremented by the head_mask’s gradients. However, these gradients are not being turned to zero after each iteration. In PyTorch the gradients are accumulated and there is a recommendation to turn it to zero after each iteration. Wouldn’t be adequate to call head_mask.grad.data.zero_() at line 110?

Regards

Topic		Replies	Views
Loss.backward() problems with require_grad Beginners	1	3950	August 27, 2020
Understanding BertLMPredictionHead 🤗Transformers	3	2288	February 15, 2021
Getting Zero Gradients for Bert while using HFTrainer Beginners	0	474	May 31, 2023
Bert - missing layer norm and resudual after attention block Models	4	1509	October 25, 2023
Role of attention mask in base Bert 🤗Transformers	0	329	December 22, 2022

BERTology compute_heads_importance without zero grad

Related topics