Where does the Transformers do the target text shifting in causal LM?

mk6 · February 24, 2023, 4:15am

The official tutorial on building a causal LM from scratch says that
Shifting the inputs and labels to align them happens inside the model, so the data collator just copies the inputs to create the labels..

Indeed, from the source code we can see that the DataCollatorForLanguageModeling with mlm=False does nothing else but just basically copies the input ids.

I didn’t find any document of the Transformers library describing this shifting behavior.
So where and how is it done?

joaogante · March 1, 2023, 4:26pm

Hey @mk6

As per our model design philosophy, you can find the code inside each model.

For instance, here’s the code that does it on GPT2

mk6 · March 3, 2023, 7:56am

@joaogante thanks!
I think it may be better to give a pointer to the implementation of this behavior in a more general place in the doc. As a user, I’m not familiar with your internal design philosophy as well as the implementation itself.
I thought this shifting logic was implemented somewhere in more abstract level. But it turns out to be in specific models.

WANGYIWEI · April 6, 2023, 2:44pm

Ditto, I was actually confused by the label shifting a lot until I rechecked the docu for the function of DataCollatorForLanguageModeling, and after checking its source code (torch_call()) on Github, it indeed just copied the input_ids, but the model can still update weights based on the datasets processed in such a way.

In contrast, in the custom DataSet class I wrote, in the get_item() I shifted the label manually by myself, and after fine-tuning with 50000 samples on a German GPT2, the loss value cannot go down at all. Then I realized, I might need to follow the way in other tutorials, which means, just copying the input_ids without shifting as the labels.
Such as Training a causal language model from scratch - Hugging Face NLP Course
Where you can clearly see that they highlighted the text over there:

Shifting the inputs and labels to align them happens inside the model, so the data collator just copies the inputs to create the labels.

Therefore, I’d also like to ask the HF team to make it clear in the AutoModelForCausalLM clearly that the auto-regressive models loaded from the transformers library will shift the label_ids automatically in the forward() function, and all we need to feed the data collator or a custom dataloader is just ['input_ids', 'attention_mask', 'label_ids'], and the label_ids should just be a copy of the input_ids and we shouldn’t shift it manually in the data preparation phrase, otherwise, it will cause two token gap between label and logits in the forward function.

PratikJadon · February 21, 2025, 6:47am

What about the setting of -100 token for padding token in label, that is done by data collators but what if i use custom ones
or
what if i pass pre-tokenized inputs with labels without setting -100 for padding token in labels and don’t pass data collator into the trainer.

please help me with this query.

Topic		Replies	Views
Error in DataCollator section of Hugging Face Tutorial LM fine tuning Beginners	2	258	January 12, 2024
Gemma3 - shift labels to the right 🤗Transformers	3	64	April 8, 2025
Source and target vs input and labels for causal autoregressive language models Beginners	1	1724	July 27, 2022
How is the data shifted by one token during CausalLM fine tuning Models	4	3179	April 14, 2025
Data Preparation for CausalLM 🤗Transformers	1	1268	March 16, 2023

Where does the Transformers do the target text shifting in causal LM?

Related topics