Alternative Language Modeling Loss Calculation

pxyyy · September 25, 2024, 2:32am

Hi community, I am interested in the implementation of the language modeling loss in Causal Language Models in transformers library.

For example here transformers/src/transformers/models/llama/modeling_llama.py at main · huggingface/transformers · GitHub. CrossEntropyLoss defaults to reduction=‘mean’, which actually scales the language modeling loss by the length of the input sequence.

To put it in math, we maximize the likelihood of an input sequence:

So it seems that the language modeling loss requires us to use the sum of the loss on each token rather than taking the mean? If so, the reason for transformers library to take the mean loss on each token, I guess, could be for more stable token and parallel computation?

Topic		Replies	Views
Masked language modeling loss 🤗Transformers	1	4759	August 13, 2020
How to compute per-token loss when doing language modeling? 🤗Transformers	3	3488	August 23, 2023
Reason for discrepancy between loss calculation in XLNetLMHeadModel and GPT2LMHeadModel 🤗Transformers	0	433	July 12, 2022
Having troubel in understanding what loss is currently in use Beginners	1	776	November 24, 2023
Encoder Decoder Loss 🤗Transformers	6	9068	October 14, 2021

Alternative Language Modeling Loss Calculation

Related topics