How is the data shifted by one token during CausalLM fine tuning

oqq09 · April 12, 2023, 10:44pm

Hi,
I am trying to fine tune distilgpt2 model on the mnli dataset to turn into a classifier by generating the right label (contradiction, entailment and neutral). I followed the example given on transformers documentation.

I am debugging the Trainer step-by-step because I would like to see how the “input_ids”/“labels” are shifted by one token by using the DataCollator as mentioned in the documentation I provided the link to above, which states the following:

Now create a batch of examples using DataCollatorForLanguageModeling. It’s more efficient to dynamically pad the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.
Use the end-of-sequence token as the padding token and set mlm=False . This will use the inputs as labels shifted to the right by one element:
>> from transformers import DataCollatorForLanguageModeling 
>> tokenizer.pad_token = tokenizer.eos_token 
>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)```

My code looks as follows:

model_checkpoint = "distilgpt2"
batch_size = 24
sequence_length = block_size = 512

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
tokenizer.pad_token = tokenizer.eos_token

def form_training_prompts(example):
    hypothesis = example["hypothesis"]
    premise = example["premise"]
    class_label = ["entailment", "neutral", "contradiction"][example["label"]]

    example[
        "text"
    ] = f"mnli hypothesis: {hypothesis} premise: {premise} target: {class_label}<|endoftext|>"
    return example

def tokenizes_text(dataset):
    tokenized = tokenizer(dataset["text"], return_tensors="np")
    return tokenized

# %%
def group_texts(examples, block_size):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

# Create prompts
dataset = dataset.map(
    form_training_prompts,
    remove_columns=["hypothesis", "premise", "label", "idx"],
    load_from_cache_file=False,
    desc="Generating text prompt",
)

# Tokenize prompts
dataset = dataset.map(
    tokenizes_text,
    batched=True,
    batch_size=batch_size,
    num_proc=1,
    remove_columns=dataset.column_names,
    load_from_cache_file=False,
    desc="Tokenizing text",
)

dataset = dataset.map(
    group_texts,
    batched=True,
    batch_size=batch_size,
    num_proc=1,
    load_from_cache_file=False,
    desc="Packing sequences",
    fn_kwargs={"block_size": block_size}
)

The validation set is created with the same process. The code continues as below:

from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

training_args = TrainingArguments(
    output_dir="model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
)

trainer = Trainer(model=model, args=training_args, train_dataset=dataset,
    eval_dataset=eval_dataset, data_collator=data_collator,
)

trainer.train()

As I follow the code while debugging, I print of the “inputs” (from script at line 1872 of the Trainer source code) variable in the training loop looks like this:

{'input_ids': tensor([
        [**13**,   383,  3721,  ..., 13591,    11,  **2592**],
        [**13262**,   515,  4560,  ...,  9022,  6890,  **1597**],
        [ **3950**,  3037,  1998,  ..., 22908,    11, **22908**],
        ...,
        [ **4038**,    13,  6601,  ...,   511, 31156,  **4788**],
        [  **355**,   857,  3220,  ..., 12678,   286,  **3113**],
        [  **338**,  1410,    13,  ...,   642,    38,  **2831**]]),
 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        ...,
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1]]),
 'labels': tensor([
        [**13**,   383,  3721,  ..., 13591,    11,  **2592**],
        [**13262**,   515,  4560,  ...,  9022,  6890,  **1597**],
        [ **3950**,  3037,  1998,  ..., 22908,    11, **22908**],
        ...,
        [ **4038**,    13,  6601,  ...,   511, 31156,  **4788**],
        [  **355**,   857,  3220,  ..., 12678,   286,  **3113**],
        [  **338**,  1410,    13,  ...,   642,    38,  **2831**]])}
special variables:
function variables:
data: {'input_ids': tensor([[   13,   38...,  2831]]), 'attention_mask': tensor([[1, 1, 1,  ....1, 1, 1]]), 'labels': tensor([[   13,   38...,  2831]])}
encodings: None
is_fast: False
n_sequences: None
_MutableMapping__marker: <object object at 0x7f0c93384180>
_abc_impl: <_abc._abc_data object at 0x7f0c803f1900>
_encodings: None
_n_sequences: None
inputs.input_ids
tensor([[   13,   383,  3721,  ..., 13591,    11,  2592],
        [13262,   515,  4560,  ...,  9022,  6890,  1597],
        [ 3950,  3037,  1998,  ..., 22908,    11, 22908],
        ...,
        [ 4038,    13,  6601,  ...,   511, 31156,  4788],
        [  355,   857,  3220,  ..., 12678,   286,  3113],
        [  338,  1410,    13,  ...,   642,    38,  2831]])

I have highlighted in double asterixis the first and last token id in the input_ids tensor and labels tensor, and they are the same;

going back to my question where does the token shifting take place? Am I misunderstanding how this works?
After grouping the data into the same batch (group_text function), how does the model distinguish between the different examples and not assume its all one? Wouldn’t adding an eos token at the end of each prompt example help?

paulopirozelli · May 10, 2023, 2:34pm

I had the same doubt. Did you find an answer?

Mursel · June 14, 2023, 5:05pm

There is an explanation in documentation on how labels are shifted inside the model: Causal language modeling

Also, there is a PR in transformers github repo on this: Shifting labels for causal LM when using label smoother by seungeunrho · Pull Request #17987 · huggingface/transformers · GitHub

So, shifting is handled inside the model. ‘input_ids’ and ‘labels’ can be very same tensors, however the model will do a ‘causal-shift’ inside.
In an example, let’s assume we have ‘input_ids’ as [1,2,3,4,5,6,7,8] and ‘labels’ again same tensor [1,2,3,4,5,6,7,8]; the model will do the shifting such that [null, 1,2,3,4,5,6,7] will predict [1,2,3,4,5,6,7,8]

cerisara · February 17, 2024, 5:11pm

It’s actually not done only for training, but right in the forward() pass;
so labels are systematically shifted in the forward pass, which enables computing the perplexity by simply copying the input_ids; at least for Bloom:

github.com

seungeunrho/transformers/blob/988fab92806dc8db0b0218018ee5a582f4545193/src/transformers/models/bloom/modeling_bloom.py#L907


      
              head_mask=None,
              inputs_embeds=None,
              labels=None,
              use_cache=None,
              output_attentions=None,
              output_hidden_states=None,
              return_dict=None,
          ) -> Union[Tuple[torch.Tensor], CausalLMOutputWithCrossAttentions]:
              r"""
              labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
                  Labels for language modeling. Note that the labels **are shifted** inside the model, i.e. you can set
                  `labels = input_ids` Indices are selected in `[-100, 0, ..., config.vocab_size]` All labels set to `-100`
                  are ignored (masked), the loss is only computed for labels in `[0, ..., config.vocab_size]`
              """
              return_dict = return_dict if return_dict is not None else self.config.use_return_dict
          
              transformer_outputs = self.transformer(
                  input_ids,
                  past_key_values=past_key_values,
                  attention_mask=attention_mask,
                  position_ids=position_ids,

I haven’t checked whether every other model does the same, but that’d be a terrible design mistake if they don’t.

taras-sereda · April 14, 2025, 11:04pm

Labels are shifted in CausalLMLoss function:

github.com/huggingface/transformers

src/transformers/loss/loss_utils.py

main


      
              ignore_index: int = -100,
              shift_labels: Optional[torch.Tensor] = None,
              **kwargs,
          ) -> torch.Tensor:
              # Upcast to float if we need to compute the loss to avoid potential precision issues
              logits = logits.float()
          
              if shift_labels is None:
                  # Shift so that tokens < n predict n
                  labels = nn.functional.pad(labels, (0, 1), value=ignore_index)
                  shift_labels = labels[..., 1:].contiguous()
          
              # Flatten the tokens
              logits = logits.view(-1, vocab_size)
              shift_labels = shift_labels.view(-1)
              # Enable model parallelism
              shift_labels = shift_labels.to(logits.device)
              loss = fixed_cross_entropy(logits, shift_labels, num_items_in_batch, ignore_index, **kwargs)
              return loss

So this is applicable to training of any Language Model regardless of model architecture as soon as one uses ForCausalLMLoss that is a default loss for causal LM.

First, this confused me as well, and I agree that this behavior should be better described. Definitely a subject for improvements.

Topic		Replies	Views
Where does the Transformers do the target text shifting in causal LM? Beginners	4	4868	February 21, 2025
Error in DataCollator section of Hugging Face Tutorial LM fine tuning Beginners	2	258	January 12, 2024
Data Preparation for CausalLM 🤗Transformers	1	1277	March 16, 2023
Gemma3 - shift labels to the right 🤗Transformers	3	70	April 8, 2025
A question about the DataCollator for LM 🤗Tokenizers	2	360	May 6, 2024

How is the data shifted by one token during CausalLM fine tuning

Related topics