How is the loss comuted for sequence to sequence models?

Simzalabim · May 1, 2022, 12:14pm

If I understand it correctly, sequence-to-sequence models use cross-entropy as the loss function. Here, each output token is compared with the true token.

Now, with for example text summarization, does this mean that during training the model is forced to generate a summary of the exact same length as the true summary associated with the input? If not, how is the loss computed in the case where the length of the generated summary differs from the length of the true summary?

dblakely · June 14, 2022, 9:56pm

More or less, yes, it’s forced to generate an output that’s the same length as the labels. However, the thing is that for training, your model doesn’t actually “generate”; it runs just a single forward call. You feed the labels into the model all at once and the model will output a big set of logits for each token and then compute the loss for each token. For example, you can see this here in the T5 code. Note that it’s not calling the generate method or running a loop or anything.

Simzalabim · June 16, 2022, 4:05pm

Clear reply. Thank you!

dblakely · June 17, 2022, 8:36pm

Btw, if you want more background info on how it works, you can also look up “teacher forcing” or “maximum likelihood estimation with teacher forcing”.

Kaveri · June 6, 2023, 2:21pm

@ dblakely To add to your comment: I found this lecture very useful for understanding how CE for seq2seq models are calculated: Stanford CS224N NLP with Deep Learning | Winter 2021 | Lecture 6 - Simple and LSTM RNNs - YouTube

fahad0521 · November 21, 2024, 5:42am

in the validation phase do we also use the teacher forcing not autoregressive technique so in order to get same logits equal to the ground truth like in training just the difference is that we donot update the parmaters in the validation right?

DylanAndrew · November 21, 2024, 6:32am

You’re right! In sequence-to-sequence models, cross-entropy compares each predicted token with the true one. For text summarization, the model isn’t forced to generate a summary of the exact same length as the true one. It stops generating when it predicts an end-of-sequence token or reaches a max length. If the generated summary is shorter or longer, the loss is calculated by comparing tokens, ignoring padding if used, and stopping at the EOS token.

fahad0521 · November 22, 2024, 5:34am

are you suggesting to use the TEACHER forcing in the validation phase to compute the cross entryop loss NOT the autoregressive techniqie if yes then also the question arises why not autoregressive technique since the validation should mimic test phase?
thanks

siteadmin · November 22, 2024, 7:22am

No, sequence-to-sequence models do not require the generated summary to match the length of the true summary. During training, the loss is computed for each token up to the end of the shorter sequence, typically by padding shorter sequences to match the length of the longer ones.

fahad0521 · November 22, 2024, 7:29am

since it is confirm that we use the teacher forcing in the train so let say if we have 9 tokens in the ground truth we have the raw logits for each tokens so we can compute the loss
in case of validation do we use the teacher forcing or the autoregressive technique
bz if we use the teacher forcing so we can compute the loss as we have the logist for each tokens (input lable) just the difference from training we donoy update the parmateres here in validation
or in practice we prefer the autoregressive technique if yes then how we supposed to compute the loss?
kindly guide us
thanks

Topic		Replies	Views
Sequence to sequence model Intermediate	0	67	November 22, 2024
Encoder Decoder Loss 🤗Transformers	6	9006	October 14, 2021
Loss in a Seq2Seq task 🤗Transformers	0	156	June 5, 2024
What is loss function for T5 Models	13	12886	February 25, 2024
Understanding the encoder-decoder loss calculation VS CLM loss Beginners	0	344	February 21, 2024

How is the loss comuted for sequence to sequence models?

Related topics