How is the data shifted by one token during CausalLM fine tuning

There is an explanation in documentation on how labels are shifted inside the model: Causal language modeling

Also, there is a PR in transformers github repo on this: Shifting labels for causal LM when using label smoother by seungeunrho · Pull Request #17987 · huggingface/transformers · GitHub

So, shifting is handled inside the model. ‘input_ids’ and ‘labels’ can be very same tensors, however the model will do a ‘causal-shift’ inside.
In an example, let’s assume we have ‘input_ids’ as [1,2,3,4,5,6,7,8] and ‘labels’ again same tensor [1,2,3,4,5,6,7,8]; the model will do the shifting such that [null, 1,2,3,4,5,6,7] will predict [1,2,3,4,5,6,7,8]

2 Likes