I recently found that forward()
of newly released LLM (such as Llama) replaced CrossEntropyLoss
with a self.loss_function
to calculate the next token prediction loss. However, the forward()
of old language model such as GPT2 remains unchanged.
I wonder what is the difference? Thanks in advance!