The logits are not shifted, just the last value is ignored. The labels are shifted inside the model as described in the docs ("Note that the labels are shifted inside the model, i.e. you can set labels = input_ids") because we want to avoid any processing on them and just set them equal to the inputs. That way the batch creation can be as easy as possible. The downside is that you don’t compute the loss on the last character of the sentence, but we found it’s acceptable since the sequence length is 512.
This is an old post but quite relevant to the problem I am facing. In my loss function if I pass the labels as ['input_ids] then the tf.keras.losses.SparseCategoricalCrossentropy recieves fails with this error Received a label value of 50257 which is outside the valid range of [0, 50257). I can see that in my data i have 50257 in the middle. Label values: 5633 1312 2911 1312 460 423 257 2863 284 6938 340 764 836 256 6044 284 1560 502 764 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 5633 1312 892 326 264 5340 5145 264 2081 475 788 345 423 284 2822 257 1700 2137 764 50257 50257 50257 50257 50257 50257 764 1616 804 588 2838 294 264 257 7932 599 7115 3918 764 1312 655 1392 … I am a little confused here . should I be passing outputs[0] from the model instead of the input_ids from the batch. I am trying to add a layer on gpt-2 for text generation in tensorflow. Should I be passing the labels as outputs[0]. Please advice . @sgugger@gmihaila
@sgugger Hi , I hope you are well, sorry in this code can I compare the logits and lables to compute new loss? I mean the output which can be compared with label can be logits?