I recently found that forward() of newly released LLM (such as Llama) replaced CrossEntropyLoss with a self.loss_function to calculate the next token prediction loss. However, the forward() of old language model such as GPT2 remains unchanged.
I wonder what is the difference? Thanks in advance!