BERTology compute_heads_importance without zero grad

Hello,
analyzing the code from https://github.com/huggingface/transformers/blob/master/examples/bertology/run_bertology.py I could understand that the head_importance variable is being incremented by the head_mask’s gradients. However, these gradients are not being turned to zero after each iteration. In PyTorch the gradients are accumulated and there is a recommendation to turn it to zero after each iteration. Wouldn’t be adequate to call head_mask.grad.data.zero_() at line 110?

Regards