Modeling_bert use next-token prediction?

TianlaiChen · September 5, 2024, 4:21am

In modeling_bert.py file class BertLMHeadModel(BertPreTrainedModel),

huggingface/transformers/blob/main/src/transformers/models/bert/modeling_bert.py#L1372


      
              inputs_embeds=inputs_embeds,
              encoder_hidden_states=encoder_hidden_states,
              encoder_attention_mask=encoder_attention_mask,
              past_key_values=past_key_values,
              use_cache=use_cache,
              output_attentions=output_attentions,
              output_hidden_states=output_hidden_states,
              return_dict=return_dict,
          )
          
          sequence_output = outputs[0]
          prediction_scores = self.cls(sequence_output)
          
          lm_loss = None
          if labels is not None:
              # we are doing next-token prediction; shift prediction scores and input ids by one
              shifted_prediction_scores = prediction_scores[:, :-1, :].contiguous()
              labels = labels[:, 1:].contiguous()
              loss_fct = CrossEntropyLoss()
              lm_loss = loss_fct(shifted_prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))

It has following codes for loss implementation:

        sequence_output = outputs[0]
        prediction_scores = self.cls(sequence_output)

        lm_loss = None
        if labels is not None:
            # we are doing next-token prediction; shift prediction scores and input ids by one
            shifted_prediction_scores = prediction_scores[:, :-1, :].contiguous()
            labels = labels[:, 1:].contiguous()
            loss_fct = CrossEntropyLoss()
            lm_loss = loss_fct(shifted_prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))

Why does BERT do next token prediction here with label shift?

nielsr · September 9, 2024, 10:29am

Hi,

Yes for any LLM which you can train in the Transformers library, the model will internally shift the labels one position so that it learns to predict the next token. The convenience of this is that users can just copy the labels from the inputs, i.e. labels = input_ids.clone() - although users then typically also replace tokens which the models shouldn’t learn to predict (like padding tokens) by -100.

Visually (taken from my explanation here):

As can be seen, the labels (top row) are equal to the inputs (bottom row), just shifted one position to the left, and with tokens which the model shouldn’t learn to predict (like the special <|begin_of_text|> in the figure above) replaced by -100.

TianlaiChen · September 9, 2024, 6:36pm

Thanks a lot @nielsr !

Yes, I understand next token prediction and label shift. But BERT here is not a CLM model, so I am confused why it has label shift. Given its a MLM, I assume it should just do corss entropy over masked tokens and there is no need for shift?

nielsr · September 10, 2024, 7:30am

That’s because there were some people interested in initializing decoder-only LLMs with the weights of BERT. This was mainly for the EncoderDecoderModel class, where the weights of the encoder and decoder were both initialized from a pre-trained BERT. See Leveraging Pre-trained Language Model Checkpoints for Encoder-Decoder Models.

system · September 16, 2024, 7:12pm

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Use of "input_ids,token_type_ids and lm_labels" in BERT Language model 🤗Transformers	1	1040	September 20, 2020
Unexpected result from transformer model prediction Beginners	0	288	November 21, 2021
Questions on the `BertModelLMHeadModel` 🤗Transformers	7	6228	October 5, 2020
BERT Next Sentence Prediction: How to do predictions? Beginners	5	7548	September 29, 2022
Understanding the encoder-decoder loss calculation VS CLM loss Beginners	0	344	February 21, 2024

Modeling_bert use next-token prediction?

Related topics