A question about the DataCollator for LM

dydavide · May 5, 2024, 3:34pm

I have a question about DataCollatorForLanguageModeling, when training a LM.
I saw this video, which explains very well, how the training process works.
Starting from minute 5:12 it says “the datacollator shifts the input, such that the label is the next token in the sequence for every single token in the input”
It make sense to me and it is a nice explanation about what is happening behind the scene.
But then, looking into the documentation for understanding the mlm parameter I found the following:

mlm (bool, optional, defaults to True) — Whether or not to use masked language modeling. If set to False, the labels are the same as the inputs with the padding tokens ignored (by setting them to -100). Otherwise, the labels are -100 for non-masked tokens and the value to predict for the masked token.

So, now I’m totally confused. Is DataCollator shifting the tokens to the left? Or it is controlling only the behavior of the padding tokens?

Thanks

nielsr · May 6, 2024, 9:06am

Hi,

No the data collator itself is not shifting any tokens. Typically the labels are just a copy of the input_ids, with padding tokens replaced by -100 (which is the ignore index of the cross-entropy loss in PyTorch).

The shifting of the tokens one position happens inside the model (hence the user doesn’t need to take care of that). This can be seen here for Llama for instance.

system · May 6, 2024, 9:06pm

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Error in DataCollator section of Hugging Face Tutorial LM fine tuning Beginners	2	258	January 12, 2024
Where does the Transformers do the target text shifting in causal LM? Beginners	4	4864	February 21, 2025
Documentation: Transformers Language Modeling Section Beginners	0	325	May 14, 2022
How is the data shifted by one token during CausalLM fine tuning Models	4	3202	April 14, 2025
Data Preparation for CausalLM 🤗Transformers	1	1275	March 16, 2023

A question about the DataCollator for LM

Related topics