Transformer shifting output question

RaushanTurganbay · May 13, 2024, 8:54am

Hi!

You are right, to compute loss for autoregressive models we have to shift the labels by one to match the generated tokens. In HuggingFace models we use the second approach. In other words:

pred_logits = pred_logits[:, :-1] # all predictions except for the last token
labels = labels[:, 1:] # all labels expect for the first one

So for ex if the inputs and labels are: [1, 2, 3, 4, 5], the model will generate in the perfect case [2, 3, 4, 5, 6]. And we get rid of the new token 6, and the first label 1.

Topic		Replies	Views
Does the transformer automatically shift by one position when calculating the autoregressive loss during the forward pass? Beginners	1	30	March 20, 2025
Source and target vs input and labels for causal autoregressive language models Beginners	1	1777	July 27, 2022
Encoder Decoder Loss 🤗Transformers	6	9035	October 14, 2021
GPT-2 shift logits and labels 🤗Transformers	5	5901	May 12, 2023
(first token generation puzzle)Why does transformers take the last dimension as output when generating the first token in language generation process? 🤗Transformers	9	2147	May 11, 2025

Transformer shifting output question

Related topics