GPT-2 shift logits and labels

I am working with GPT-2 and I was looking at the LM head and how it performs the forward pass when labels are provided:

It looks like the logits are shifted right (last value is ignored) and the labels are shifted left (first value is ignored).

Why are the logits and labels shifted in different directions?

1 Like

The logits are not shifted, just the last value is ignored. The labels are shifted inside the model as described in the docs ("Note that the labels are shifted inside the model, i.e. you can set labels = input_ids") because we want to avoid any processing on them and just set them equal to the inputs. That way the batch creation can be as easy as possible. The downside is that you don’t compute the loss on the last character of the sentence, but we found it’s acceptable since the sequence length is 512.


Thank you for your answer @sgugger. I guess I wanted to know why are we ignoring the last logit value? I think I understand why we ignore firs label.

We have no label for that last logits, that’s why it’s ignore in the loss computation.

1 Like