How is the loss comuted for sequence to sequence models?

If I understand it correctly, sequence-to-sequence models use cross-entropy as the loss function. Here, each output token is compared with the true token.

Now, with for example text summarization, does this mean that during training the model is forced to generate a summary of the exact same length as the true summary associated with the input? If not, how is the loss computed in the case where the length of the generated summary differs from the length of the true summary?

1 Like

More or less, yes, it’s forced to generate an output that’s the same length as the labels. However, the thing is that for training, your model doesn’t actually “generate”; it runs just a single forward call. You feed the labels into the model all at once and the model will output a big set of logits for each token and then compute the loss for each token. For example, you can see this here in the T5 code. Note that it’s not calling the generate method or running a loop or anything.

Clear reply. Thank you!

Btw, if you want more background info on how it works, you can also look up “teacher forcing” or “maximum likelihood estimation with teacher forcing”.

2 Likes

@ dblakely To add to your comment: I found this lecture very useful for understanding how CE for seq2seq models are calculated: Stanford CS224N NLP with Deep Learning | Winter 2021 | Lecture 6 - Simple and LSTM RNNs - YouTube