Modeling_bert use next-token prediction?

That’s because there were some people interested in initializing decoder-only LLMs with the weights of BERT. This was mainly for the EncoderDecoderModel class, where the weights of the encoder and decoder were both initialized from a pre-trained BERT. See Leveraging Pre-trained Language Model Checkpoints for Encoder-Decoder Models.

1 Like