Source and target vs input and labels for causal autoregressive language models

clam004 · July 2, 2022, 12:00am

In the Training a causal language model from scratch tutorial, the model accepts the inputs and labels as the same sequence, and outputs the logits and loss, for example:

loss, logits = model(batch["input_ids"], labels=batch["input_ids"])

is it implied that before the loss is calculated, the target sequence will be the label sequence shifted to the right by one token for us? said another way, what i mean is that the model is supposed to predict the next token in autoregressive models, so the target should not be the same as the source, but the source shifted to the right with the first token missing and with one more token into the future than the source sequence. This is made more clear in the “custom loss” section of the same tutorial:

def keytoken_weighted_loss(inputs, logits, keytoken_ids, alpha=1.0):
    # Shift so that tokens < n predict n
    shift_labels = inputs[..., 1:].contiguous()
    shift_logits = logits[..., :-1, :].contiguous()
    # Calculate per-token loss
    loss_fct = CrossEntropyLoss(reduce=False)
    loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))

So I assume this is also being done for me when I pass the same sequence into model() or into the Trainer’s data_collator

data_collator=lambda data: {
                           'input_ids': torch.stack([f[0] for f in data]),
                           'attention_mask': torch.stack([f[1] for f in data]),
                           'labels': torch.stack([f[0] for f in data])
}

its just confusing to me because the way that is written, it looks as if we are doing an autoencoder like task by predicting the input itself, but i know form the definition of autoregressive that this cant be what we are doing. Please help me sort out this confusion and point me to where the target sequence gets shifted from the labels if this does indeed happen, thanks!

nadahlberg · July 27, 2022, 10:00pm

Yep, only it’s being done for you in the model’s forward pass rather than the data collator! My understanding is that all of the ModelForTaskX classes have default loss functions in their forward pass, which only get used if you include ‘labels’ in your inputs. And that these are what get used by Trainer by default. So for example, if you check out the forward pass in the GPTJForCausalLM class, you’ll notice the exact same ‘shifting’ lines as the custom loss you noted above:

# from line 846 
loss = None
if labels is not None:
    # Shift so that tokens < n predict n
    shift_logits = lm_logits[..., :-1, :].contiguous()
    shift_labels = labels[..., 1:].contiguous()
    # Flatten the tokens
    loss_fct = CrossEntropyLoss()
    loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))

Topic		Replies	Views
Where does the Transformers do the target text shifting in causal LM? Beginners	4	4616	February 21, 2025
How to label dataset for Causal Language Modeling Beginners	0	516	January 27, 2023
Gemma3 - shift labels to the right 🤗Transformers	3	40	April 8, 2025
GPT-2 shift logits and labels 🤗Transformers	5	5742	May 12, 2023
Understanding the encoder-decoder loss calculation VS CLM loss Beginners	0	339	February 21, 2024

Source and target vs input and labels for causal autoregressive language models

Related topics