Modeling_bert use next-token prediction?

Hi,

Yes for any LLM which you can train in the Transformers library, the model will internally shift the labels one position so that it learns to predict the next token. The convenience of this is that users can just copy the labels from the inputs, i.e. labels = input_ids.clone() - although users then typically also replace tokens which the models shouldn’t learn to predict (like padding tokens) by -100.

Visually (taken from my explanation here):

As can be seen, the labels (top row) are equal to the inputs (bottom row), just shifted one position to the left, and with tokens which the model shouldn’t learn to predict (like the special <|begin_of_text|> in the figure above) replaced by -100.