The official tutorial on building a causal LM from scratch says that Shifting the inputs and labels to align them happens inside the model, so the data collator just copies the inputs to create the labels..
Indeed, from the source code we can see that the DataCollatorForLanguageModeling with mlm=False does nothing else but just basically copies the input ids.
I didn’t find any document of the Transformers library describing this shifting behavior.
So where and how is it done?
I think it may be better to give a pointer to the implementation of this behavior in a more general place in the doc. As a user, I’m not familiar with your internal design philosophy as well as the implementation itself.
I thought this shifting logic was implemented somewhere in more abstract level. But it turns out to be in specific models.
Ditto, I was actually confused by the label shifting a lot until I rechecked the docu for the function of DataCollatorForLanguageModeling, and after checking its source code (torch_call()) on Github, it indeed just copied the input_ids, but the model can still update weights based on the datasets processed in such a way.
In contrast, in the custom DataSet class I wrote, in the get_item() I shifted the label manually by myself, and after fine-tuning with 50000 samples on a German GPT2, the loss value cannot go down at all. Then I realized, I might need to follow the way in other tutorials, which means, just copying the input_ids without shifting as the labels.
Such as Training a causal language model from scratch - Hugging Face Course
Where you can clearly see that they highlighted the text over there:
Shifting the inputs and labels to align them happens inside the model, so the data collator just copies the inputs to create the labels.
Therefore, I’d also like to ask the HF team to make it clear in the AutoModelForCausalLM clearly that the auto-regressive models loaded from the transformers library will shift the label_ids automatically in the forward() function, and all we need to feed the data collator or a custom dataloader is just ['input_ids', 'attention_mask', 'label_ids'], and the label_ids should just be a copy of the input_ids and we shouldn’t shift it manually in the data preparation phrase, otherwise, it will cause two token gap between label and logits in the forward function.