BERTology compute_heads_importance without zero grad

analyzing the code from I could understand that the head_importance variable is being incremented by the head_mask’s gradients. However, these gradients are not being turned to zero after each iteration. In PyTorch the gradients are accumulated and there is a recommendation to turn it to zero after each iteration. Wouldn’t be adequate to call at line 110?