Encoder Decoder Loss

padding tokens in the labels should be replaced by -100 so the cross_entriopy loss ignores the pad tokens when computing the loss.

and the loss is actually computed like this

shifted_prediction_scores = prediction_scores[:, :-1, :].contiguous() # prediction_scores is logits
labels = labels[:, 1:].contiguous()
loss_fct = CrossEntropyLoss()
lm_loss = loss_fct(shifted_prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))