I’m wondering why multiplying the outputs of T5 by some scalar before inputting in the LM head :
(Link to the original issue : https://github.com/huggingface/transformers/issues/5565)