GPT-2 shift logits and labels

I am working with GPT-2 and I was looking at the LM head and how it performs the forward pass when labels are provided: https://huggingface.co/transformers/_modules/transformers/modeling_gpt2.html#GPT2LMHeadModel

It looks like the logits are shifted right (last value is ignored) and the labels are shifted left (first value is ignored).

Why are the logits and labels shifted in different directions?

1 Like

The logits are not shifted, just the last value is ignored. The labels are shifted inside the model as described in the docs ("Note that the labels are shifted inside the model, i.e. you can set labels = input_ids") because we want to avoid any processing on them and just set them equal to the inputs. That way the batch creation can be as easy as possible. The downside is that you don’t compute the loss on the last character of the sentence, but we found it’s acceptable since the sequence length is 512.

3 Likes

Thank you for your answer @sgugger. I guess I wanted to know why are we ignoring the last logit value? I think I understand why we ignore firs label.

We have no label for that last logits, that’s why it’s ignore in the loss computation.

2 Likes

This is an old post but quite relevant to the problem I am facing. In my loss function if I pass the labels as ['input_ids] then the tf.keras.losses.SparseCategoricalCrossentropy recieves fails with this error Received a label value of 50257 which is outside the valid range of [0, 50257). I can see that in my data i have 50257 in the middle. Label values: 5633 1312 2911 1312 460 423 257 2863 284 6938 340 764 836 256 6044 284 1560 502 764 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 5633 1312 892 326 264 5340 5145 264 2081 475 788 345 423 284 2822 257 1700 2137 764 50257 50257 50257 50257 50257 50257 764 1616 804 588 2838 294 264 257 7932 599 7115 3918 764 1312 655 1392 … I am a little confused here . should I be passing outputs[0] from the model instead of the input_ids from the batch. I am trying to add a layer on gpt-2 for text generation in tensorflow. Should I be passing the labels as outputs[0]. Please advice . @sgugger @gmihaila

@sgugger Hi , I hope you are well, sorry in this code can I compare the logits and lables to compute new loss? I mean the output which can be compared with label can be logits?


    for step, batch in enumerate(train_dataloader):
        #print(step)

        b_input_ids = batch[0].to(device)
        b_labels = batch[0].to(device)
        print("b_labels",b_labels)
        print(b_labels.shape)
        b_masks = batch[1].to(device)

        optimizer.zero_grad()        

        outputs = model(  b_input_ids,
                          labels=b_labels, 
                          attention_mask = b_masks,
                          token_type_ids=None
                        )
        loss, logits = outputs[:2]