Alternative Language Modeling Loss Calculation

Hi community, I am interested in the implementation of the language modeling loss in Causal Language Models in transformers library.

For example here transformers/src/transformers/models/llama/modeling_llama.py at main · huggingface/transformers · GitHub. CrossEntropyLoss defaults to reduction=‘mean’, which actually scales the language modeling loss by the length of the input sequence.

To put it in math, we maximize the likelihood of an input sequence:


So it seems that the language modeling loss requires us to use the sum of the loss on each token rather than taking the mean? If so, the reason for transformers library to take the mean loss on each token, I guess, could be for more stable token and parallel computation?

1 Like