When fine-tuning the dialogue model (Alpaca, Vicuna), the common loss calculation method is to sum the cross-entropy loss of all tokens in each sequence and divide it by the sequence length (similar to the per-token perplexity calculation method), The final total loss is equal to the average of each sequence loss.

Is it necessary to divide by the sequence length here? If it is maximum-likelihood estimation, I understand that each token loss should be summed directly without dividing by the sequence length (equal to logprob), and finally the total loss is obtained by averaging the loss of each sequence.

Another question is that fine-tuning the dialogue model is actually the conditional probability of the answer for the instruction. Does the conditional maximum likelihood need special treatment here?