Hi!
You are right, to compute loss for autoregressive models we have to shift the labels by one to match the generated tokens. In HuggingFace models we use the second approach. In other words:
pred_logits = pred_logits[:, :-1] # all predictions except for the last token
labels = labels[:, 1:] # all labels expect for the first one
So for ex if the inputs and labels are: [1, 2, 3, 4, 5]
, the model will generate in the perfect case [2, 3, 4, 5, 6]
. And we get rid of the new token 6
, and the first label 1
.