Transformer shifting output question

GinFreecss23 · May 12, 2024, 7:52pm

Hi, I’m trying to improve my understanding of the transformer and surely details matter.
I was wondering how HF handled the shifted outputs to satisfy (along with the mask) the auto-regressive property.
Assume the inputs are (x_0,x_1,...x_T) and the outputs are (y_0,y_1,...,y_T). As we want to train an auto-regressive-like model, we wish that pred(y_k)= f(x_0,...,x_T, y_0,...y_{k-1}), a simple way to do this is to shift the elements to the right by one and mask elements that are now to the right of each token.
I can think of 2 ways to shift the outputs to the right by one :

(<SHIFT>, y_0, ...,y_{T-1}, y_T)
(<SHIFT>, y_0, ..., y_{T-1})

In my view, the second approach makes more sense. As y_T is the last output, it won’t be used to generate a token that would come after, i.e, it would never appear on the RHS of the pred function we introduced above.

What approach was followed by hugging face ?
Thanks

RaushanTurganbay · May 13, 2024, 8:54am

Hi!

You are right, to compute loss for autoregressive models we have to shift the labels by one to match the generated tokens. In HuggingFace models we use the second approach. In other words:

pred_logits = pred_logits[:, :-1] # all predictions except for the last token
labels = labels[:, 1:] # all labels expect for the first one

So for ex if the inputs and labels are: [1, 2, 3, 4, 5], the model will generate in the perfect case [2, 3, 4, 5, 6]. And we get rid of the new token 6, and the first label 1.

Topic		Replies	Views
Does the transformer automatically shift by one position when calculating the autoregressive loss during the forward pass? Beginners	1	25	March 20, 2025
Source and target vs input and labels for causal autoregressive language models Beginners	1	1748	July 27, 2022
Encoder Decoder Loss 🤗Transformers	6	9016	October 14, 2021
GPT-2 shift logits and labels 🤗Transformers	5	5854	May 12, 2023
(first token generation puzzle)Why does transformers take the last dimension as output when generating the first token in language generation process? 🤗Transformers	9	2102	May 11, 2025

Transformer shifting output question

Related topics