Error in DataCollator section of Hugging Face Tutorial LM fine tuning

metamath · January 9, 2024, 2:32pm

Hello,

I believe I have found a minor error in the documentation on the following page: Causal language modeling.

In the tutorial section about the collator, it is written as follows:

"Use the end-of-sequence token as the padding token and set mlm=False. This will use the inputs as labels shifted to the right by one element:

from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)"

However, it seems to me that DataCollatorForLanguageModeling does not shift the labels to create input_ids. Instead, it keeps both values the same.

Could there be something I am misunderstanding?

Thank you for reading my message and for considering this potential correction.

lkurlandski · January 11, 2024, 11:48pm

I believe this is handled by the forward method of the XXXFormerLMHeadModel class. For example, the transformers.models.gpt2.modeling_gpt2 from the GPT2LMHeadModel.forward method contains the following snippet:

        loss = None
        if labels is not None:
            # move labels to correct device to enable model parallelism
            labels = labels.to(lm_logits.device)
            # Shift so that tokens < n predict n
            shift_logits = lm_logits[..., :-1, :].contiguous()
            shift_labels = labels[..., 1:].contiguous()
            # Flatten the tokens
            loss_fct = CrossEntropyLoss()
            loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))

metamath · January 12, 2024, 3:16am

You are correct in saying that the label shifting handled within the model.
Therefore, the part pointed out in the tutorial is indeed an error.

Thank you for providing an example with the code.

Topic		Replies	Views
How is the data shifted by one token during CausalLM fine tuning Models	4	3178	April 14, 2025
Documentation: Transformers Language Modeling Section Beginners	0	325	May 14, 2022
A question about the DataCollator for LM 🤗Tokenizers	2	353	May 6, 2024
Where does the Transformers do the target text shifting in causal LM? Beginners	4	4807	February 21, 2025
Gemma3 - shift labels to the right 🤗Transformers	3	64	April 8, 2025

Error in DataCollator section of Hugging Face Tutorial LM fine tuning

Related topics