How is the loss comuted for sequence to sequence models?

If I understand it correctly, sequence-to-sequence models use cross-entropy as the loss function. Here, each output token is compared with the true token.

Now, with for example text summarization, does this mean that during training the model is forced to generate a summary of the exact same length as the true summary associated with the input? If not, how is the loss computed in the case where the length of the generated summary differs from the length of the true summary?

2 Likes

More or less, yes, it’s forced to generate an output that’s the same length as the labels. However, the thing is that for training, your model doesn’t actually “generate”; it runs just a single forward call. You feed the labels into the model all at once and the model will output a big set of logits for each token and then compute the loss for each token. For example, you can see this here in the T5 code. Note that it’s not calling the generate method or running a loop or anything.

Clear reply. Thank you!

Btw, if you want more background info on how it works, you can also look up “teacher forcing” or “maximum likelihood estimation with teacher forcing”.

2 Likes

@ dblakely To add to your comment: I found this lecture very useful for understanding how CE for seq2seq models are calculated: Stanford CS224N NLP with Deep Learning | Winter 2021 | Lecture 6 - Simple and LSTM RNNs - YouTube

in the validation phase do we also use the teacher forcing not autoregressive technique so in order to get same logits equal to the ground truth like in training just the difference is that we donot update the parmaters in the validation right?

1 Like

You’re right! In sequence-to-sequence models, cross-entropy compares each predicted token with the true one. For text summarization, the model isn’t forced to generate a summary of the exact same length as the true one. It stops generating when it predicts an end-of-sequence token or reaches a max length. If the generated summary is shorter or longer, the loss is calculated by comparing tokens, ignoring padding if used, and stopping at the EOS token.

1 Like

are you suggesting to use the TEACHER forcing in the validation phase to compute the cross entryop loss NOT the autoregressive techniqie if yes then also the question arises why not autoregressive technique since the validation should mimic test phase?
thanks

No, sequence-to-sequence models do not require the generated summary to match the length of the true summary. During training, the loss is computed for each token up to the end of the shorter sequence, typically by padding shorter sequences to match the length of the longer ones.

since it is confirm that we use the teacher forcing in the train so let say if we have 9 tokens in the ground truth we have the raw logits for each tokens so we can compute the loss
in case of validation do we use the teacher forcing or the autoregressive technique
bz if we use the teacher forcing so we can compute the loss as we have the logist for each tokens (input lable) just the difference from training we donoy update the parmateres here in validation
or in practice we prefer the autoregressive technique if yes then how we supposed to compute the loss?
kindly guide us
thanks