Hi all,
I was reading through the encoder decoder transformers and saw how loss was generated. But I’m just wondering how it is internally generated?
Is it something like the following: Suppose I have the following pair: ("How are you?", "I am doing great"). In this case, is it calculating the cross entropy loss for the four output tokens and then averaging them?
the EncoderDecoder model calculates the standard auto-regressive cross-entropy loss using the labels i.e the output sequence. It just shifts the labels inside the models before computing the loss.
It’s the same loss used in other seq2seq models like BART, T5, and decoder models like GPT2.
Just to extend the question though, I looked at the source code and managed to replicate the loss by huggingface, but just wondering shouldn’t it be:
Hello, I am unsure if I should ask this here or if I should create a separate post but I was looking that the way the loss is computed and it seems really confusing to me how the logits are shifted and why is it done in such a way? I have been looking online and I haven’t managed to find a proper explanation, so could you please help me by explaining why and how the logit shifting is done?
Hey, sorry for not replying earlier. The basic reason is because when the tokenizer encodes it, it will do something like "<START> My decoded sentence <END>". The output of the decoder transformer will only predict for "My decoded sentence <END>".
So the logits predict for the tokens shifted by one (without the token). And the reason we look at logits except for last one is the one value it predicts after is non-sensical, so we simply drop it.