Error in DataCollator section of Hugging Face Tutorial LM fine tuning


I believe I have found a minor error in the documentation on the following page: Causal language modeling.

In the tutorial section about the collator, it is written as follows:

"Use the end-of-sequence token as the padding token and set mlm=False. This will use the inputs as labels shifted to the right by one element:

from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)"

However, it seems to me that DataCollatorForLanguageModeling does not shift the labels to create input_ids. Instead, it keeps both values the same.

Could there be something I am misunderstanding?

Thank you for reading my message and for considering this potential correction.

1 Like

I believe this is handled by the forward method of the XXXFormerLMHeadModel class. For example, the transformers.models.gpt2.modeling_gpt2 from the GPT2LMHeadModel.forward method contains the following snippet:

        loss = None
        if labels is not None:
            # move labels to correct device to enable model parallelism
            labels =
            # Shift so that tokens < n predict n
            shift_logits = lm_logits[..., :-1, :].contiguous()
            shift_labels = labels[..., 1:].contiguous()
            # Flatten the tokens
            loss_fct = CrossEntropyLoss()
            loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
1 Like

You are correct in saying that the label shifting handled within the model.
Therefore, the part pointed out in the tutorial is indeed an error.

Thank you for providing an example with the code.