Encoder Decoder Loss

Hey, sorry for not replying earlier. The basic reason is because when the tokenizer encodes it, it will do something like "<START> My decoded sentence <END>". The output of the decoder transformer will only predict for "My decoded sentence <END>".

So the logits predict for the tokens shifted by one (without the token). And the reason we look at logits except for last one is the one value it predicts after is non-sensical, so we simply drop it.

1 Like