A question about the DataCollator for LM

I have a question about DataCollatorForLanguageModeling, when training a LM.
I saw this video, which explains very well, how the training process works.
Starting from minute 5:12 it says “the datacollator shifts the input, such that the label is the next token in the sequence for every single token in the input”
It make sense to me and it is a nice explanation about what is happening behind the scene.
But then, looking into the documentation for understanding the mlm parameter I found the following:

mlm (bool, optional, defaults to True) — Whether or not to use masked language modeling. If set to False, the labels are the same as the inputs with the padding tokens ignored (by setting them to -100). Otherwise, the labels are -100 for non-masked tokens and the value to predict for the masked token.

So, now I’m totally confused. Is DataCollator shifting the tokens to the left? Or it is controlling only the behavior of the padding tokens?

Thanks

Hi,

No the data collator itself is not shifting any tokens. Typically the labels are just a copy of the input_ids, with padding tokens replaced by -100 (which is the ignore index of the cross-entropy loss in PyTorch).

The shifting of the tokens one position happens inside the model (hence the user doesn’t need to take care of that). This can be seen here for Llama for instance.

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.