GPT-2 shift logits and labels

gmihaila · August 21, 2020, 11:31am

I am working with GPT-2 and I was looking at the LM head and how it performs the forward pass when labels are provided: https://huggingface.co/transformers/_modules/transformers/modeling_gpt2.html#GPT2LMHeadModel

It looks like the logits are shifted right (last value is ignored) and the labels are shifted left (first value is ignored).

Why are the logits and labels shifted in different directions?

sgugger · August 21, 2020, 1:47pm

The logits are not shifted, just the last value is ignored. The labels are shifted inside the model as described in the docs ("Note that the labels are shifted inside the model, i.e. you can set labels = input_ids") because we want to avoid any processing on them and just set them equal to the inputs. That way the batch creation can be as easy as possible. The downside is that you don’t compute the loss on the last character of the sentence, but we found it’s acceptable since the sequence length is 512.

gmihaila · August 22, 2020, 11:27am

Thank you for your answer @sgugger. I guess I wanted to know why are we ignoring the last logit value? I think I understand why we ignore firs label.

sgugger · August 24, 2020, 11:22am

We have no label for that last logits, that’s why it’s ignore in the loss computation.

Shilz · May 6, 2023, 2:18am

This is an old post but quite relevant to the problem I am facing. In my loss function if I pass the labels as ['input_ids] then the tf.keras.losses.SparseCategoricalCrossentropy recieves fails with this error Received a label value of 50257 which is outside the valid range of [0, 50257). I can see that in my data i have 50257 in the middle. Label values: 5633 1312 2911 1312 460 423 257 2863 284 6938 340 764 836 256 6044 284 1560 502 764 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 5633 1312 892 326 264 5340 5145 264 2081 475 788 345 423 284 2822 257 1700 2137 764 50257 50257 50257 50257 50257 50257 764 1616 804 588 2838 294 264 257 7932 599 7115 3918 764 1312 655 1392 … I am a little confused here . should I be passing outputs[0] from the model instead of the input_ids from the batch. I am trying to add a layer on gpt-2 for text generation in tensorflow. Should I be passing the labels as outputs[0]. Please advice . @sgugger @gmihaila

SUNM · May 12, 2023, 4:57am

@sgugger Hi , I hope you are well, sorry in this code can I compare the logits and lables to compute new loss? I mean the output which can be compared with label can be logits?


    for step, batch in enumerate(train_dataloader):
        #print(step)

        b_input_ids = batch[0].to(device)
        b_labels = batch[0].to(device)
        print("b_labels",b_labels)
        print(b_labels.shape)
        b_masks = batch[1].to(device)

        optimizer.zero_grad()        

        outputs = model(  b_input_ids,
                          labels=b_labels, 
                          attention_mask = b_masks,
                          token_type_ids=None
                        )
        loss, logits = outputs[:2]

Topic		Replies	Views
Shifting ids to the right when training GPT-2 on text generation? Beginners	4	2312	January 25, 2023
Newbie Understanding GPT2 loss 🤗Transformers	1	5079	March 12, 2023
How to label dataset for Causal Language Modeling Beginners	0	520	January 27, 2023
Gemma3 - shift labels to the right 🤗Transformers	3	57	April 8, 2025
Where does the Transformers do the target text shifting in causal LM? Beginners	4	4768	February 21, 2025

GPT-2 shift logits and labels

Related topics