How is the data shifted by one token during CausalLM fine tuning

Hi,
I am trying to fine tune distilgpt2 model on the mnli dataset to turn into a classifier by generating the right label (contradiction, entailment and neutral). I followed the example given on transformers documentation.

I am debugging the Trainer step-by-step because I would like to see how the “input_ids”/“labels” are shifted by one token by using the DataCollator as mentioned in the documentation I provided the link to above, which states the following:

Now create a batch of examples using DataCollatorForLanguageModeling. It’s more efficient to dynamically pad the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.
Use the end-of-sequence token as the padding token and set mlm=False . This will use the inputs as labels shifted to the right by one element:

>> from transformers import DataCollatorForLanguageModeling 
>> tokenizer.pad_token = tokenizer.eos_token 
>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)```

My code looks as follows:

model_checkpoint = "distilgpt2"
batch_size = 24
sequence_length = block_size = 512

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
tokenizer.pad_token = tokenizer.eos_token

def form_training_prompts(example):
    hypothesis = example["hypothesis"]
    premise = example["premise"]
    class_label = ["entailment", "neutral", "contradiction"][example["label"]]

    example[
        "text"
    ] = f"mnli hypothesis: {hypothesis} premise: {premise} target: {class_label}<|endoftext|>"
    return example

def tokenizes_text(dataset):
    tokenized = tokenizer(dataset["text"], return_tensors="np")
    return tokenized

# %%
def group_texts(examples, block_size):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

# Create prompts
dataset = dataset.map(
    form_training_prompts,
    remove_columns=["hypothesis", "premise", "label", "idx"],
    load_from_cache_file=False,
    desc="Generating text prompt",
)

# Tokenize prompts
dataset = dataset.map(
    tokenizes_text,
    batched=True,
    batch_size=batch_size,
    num_proc=1,
    remove_columns=dataset.column_names,
    load_from_cache_file=False,
    desc="Tokenizing text",
)

dataset = dataset.map(
    group_texts,
    batched=True,
    batch_size=batch_size,
    num_proc=1,
    load_from_cache_file=False,
    desc="Packing sequences",
    fn_kwargs={"block_size": block_size}
)

The validation set is created with the same process. The code continues as below:

from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

training_args = TrainingArguments(
    output_dir="model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
)

trainer = Trainer(model=model, args=training_args, train_dataset=dataset,
    eval_dataset=eval_dataset, data_collator=data_collator,
)

trainer.train()

As I follow the code while debugging, I print of the “inputs” (from script at line 1872 of the Trainer source code) variable in the training loop looks like this:

{'input_ids': tensor([
        [**13**,   383,  3721,  ..., 13591,    11,  **2592**],
        [**13262**,   515,  4560,  ...,  9022,  6890,  **1597**],
        [ **3950**,  3037,  1998,  ..., 22908,    11, **22908**],
        ...,
        [ **4038**,    13,  6601,  ...,   511, 31156,  **4788**],
        [  **355**,   857,  3220,  ..., 12678,   286,  **3113**],
        [  **338**,  1410,    13,  ...,   642,    38,  **2831**]]),
 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        ...,
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1]]),
 'labels': tensor([
        [**13**,   383,  3721,  ..., 13591,    11,  **2592**],
        [**13262**,   515,  4560,  ...,  9022,  6890,  **1597**],
        [ **3950**,  3037,  1998,  ..., 22908,    11, **22908**],
        ...,
        [ **4038**,    13,  6601,  ...,   511, 31156,  **4788**],
        [  **355**,   857,  3220,  ..., 12678,   286,  **3113**],
        [  **338**,  1410,    13,  ...,   642,    38,  **2831**]])}
special variables:
function variables:
data: {'input_ids': tensor([[   13,   38...,  2831]]), 'attention_mask': tensor([[1, 1, 1,  ....1, 1, 1]]), 'labels': tensor([[   13,   38...,  2831]])}
encodings: None
is_fast: False
n_sequences: None
_MutableMapping__marker: <object object at 0x7f0c93384180>
_abc_impl: <_abc._abc_data object at 0x7f0c803f1900>
_encodings: None
_n_sequences: None
inputs.input_ids
tensor([[   13,   383,  3721,  ..., 13591,    11,  2592],
        [13262,   515,  4560,  ...,  9022,  6890,  1597],
        [ 3950,  3037,  1998,  ..., 22908,    11, 22908],
        ...,
        [ 4038,    13,  6601,  ...,   511, 31156,  4788],
        [  355,   857,  3220,  ..., 12678,   286,  3113],
        [  338,  1410,    13,  ...,   642,    38,  2831]])

I have highlighted in double asterixis the first and last token id in the input_ids tensor and labels tensor, and they are the same;

  • going back to my question where does the token shifting take place? Am I misunderstanding how this works?

  • After grouping the data into the same batch (group_text function), how does the model distinguish between the different examples and not assume its all one? Wouldn’t adding an eos token at the end of each prompt example help?

1 Like

I had the same doubt. Did you find an answer?

There is an explanation in documentation on how labels are shifted inside the model: Causal language modeling

Also, there is a PR in transformers github repo on this: Shifting labels for causal LM when using label smoother by seungeunrho · Pull Request #17987 · huggingface/transformers · GitHub

So, shifting is handled inside the model. ‘input_ids’ and ‘labels’ can be very same tensors, however the model will do a ‘causal-shift’ inside.
In an example, let’s assume we have ‘input_ids’ as [1,2,3,4,5,6,7,8] and ‘labels’ again same tensor [1,2,3,4,5,6,7,8]; the model will do the shifting such that [null, 1,2,3,4,5,6,7] will predict [1,2,3,4,5,6,7,8]

1 Like

It’s actually not done only for training, but right in the forward() pass;
so labels are systematically shifted in the forward pass, which enables computing the perplexity by simply copying the input_ids; at least for Bloom:

I haven’t checked whether every other model does the same, but that’d be a terrible design mistake if they don’t.