If I understand it correctly, sequence-to-sequence models use cross-entropy as the loss function. Here, each output token is compared with the true token.
Now, with for example text summarization, does this mean that during training the model is forced to generate a summary of the exact same length as the true summary associated with the input? If not, how is the loss computed in the case where the length of the generated summary differs from the length of the true summary?
More or less, yes, it’s forced to generate an output that’s the same length as the labels. However, the thing is that for training, your model doesn’t actually “generate”; it runs just a single
forward call. You feed the labels into the model all at once and the model will output a big set of logits for each token and then compute the loss for each token. For example, you can see this here in the T5 code. Note that it’s not calling the
generate method or running a loop or anything.
Btw, if you want more background info on how it works, you can also look up “teacher forcing” or “maximum likelihood estimation with teacher forcing”.