What is `self.loss_function` in `forward()` of newly released LLM?

I recently found that forward() of newly released LLM (such as Llama) replaced CrossEntropyLoss with a self.loss_function to calculate the next token prediction loss. However, the forward() of old language model such as GPT2 remains unchanged.

I wonder what is the difference? Thanks in advance!

1 Like