Question about loss calculation on LLM finetuning

imone · July 14, 2023, 1:06pm

When fine-tuning the dialogue model (Alpaca, Vicuna), the common loss calculation method is to sum the cross-entropy loss of all tokens in each sequence and divide it by the sequence length (similar to the per-token perplexity calculation method), The final total loss is equal to the average of each sequence loss.

Is it necessary to divide by the sequence length here? If it is maximum-likelihood estimation, I understand that each token loss should be summed directly without dividing by the sequence length (equal to logprob), and finally the total loss is obtained by averaging the loss of each sequence.

Another question is that fine-tuning the dialogue model is actually the conditional probability of the answer for the instruction. Does the conditional maximum likelihood need special treatment here?

Topic		Replies	Views
Ideal loss and training values? Beginners	1	183	May 20, 2025
Alternative Language Modeling Loss Calculation 🤗Transformers	0	79	September 25, 2024
How to compute per-token loss when doing language modeling? 🤗Transformers	3	3232	August 23, 2023
Loss in a Seq2Seq task 🤗Transformers	0	156	June 5, 2024
Fine-tuning queries Beginners	0	39	February 20, 2025

Question about loss calculation on LLM finetuning

Related topics