Transformer shifting output question

Hi, I’m trying to improve my understanding of the transformer and surely details matter.
I was wondering how HF handled the shifted outputs to satisfy (along with the mask) the auto-regressive property.
Assume the inputs are (x_0,x_1,...x_T) and the outputs are (y_0,y_1,...,y_T). As we want to train an auto-regressive-like model, we wish that pred(y_k)= f(x_0,...,x_T, y_0,...y_{k-1}), a simple way to do this is to shift the elements to the right by one and mask elements that are now to the right of each token.
I can think of 2 ways to shift the outputs to the right by one :

  1. (<SHIFT>, y_0, ...,y_{T-1}, y_T)
  2. (<SHIFT>, y_0, ..., y_{T-1})

In my view, the second approach makes more sense. As y_T is the last output, it won’t be used to generate a token that would come after, i.e, it would never appear on the RHS of the pred function we introduced above.

What approach was followed by hugging face ?
Thanks

Hi!

You are right, to compute loss for autoregressive models we have to shift the labels by one to match the generated tokens. In HuggingFace models we use the second approach. In other words:

pred_logits = pred_logits[:, :-1] # all predictions except for the last token
labels = labels[:, 1:] # all labels expect for the first one

So for ex if the inputs and labels are: [1, 2, 3, 4, 5], the model will generate in the perfect case [2, 3, 4, 5, 6]. And we get rid of the new token 6, and the first label 1.