Yes for any LLM which you can train in the Transformers library, the model will internally shift the labels one position so that it learns to predict the next token. The convenience of this is that users can just copy the labels from the inputs, i.e. labels = input_ids.clone() - although users then typically also replace tokens which the models shouldn’t learn to predict (like padding tokens) by -100.
As can be seen, the labels (top row) are equal to the inputs (bottom row), just shifted one position to the left, and with tokens which the model shouldn’t learn to predict (like the special <|begin_of_text|> in the figure above) replaced by -100.
Yes, I understand next token prediction and label shift. But BERT here is not a CLM model, so I am confused why it has label shift. Given its a MLM, I assume it should just do corss entropy over masked tokens and there is no need for shift?
That’s because there were some people interested in initializing decoder-only LLMs with the weights of BERT. This was mainly for the EncoderDecoderModel class, where the weights of the encoder and decoder were both initialized from a pre-trained BERT. See Leveraging Pre-trained Language Model Checkpoints for Encoder-Decoder Models.