TFBertForMaskedLM outputting a identical prediction for every input

elanderson · August 2, 2021, 9:08pm

Hello,
I have been trying to use the TFBertForMaskedLM. I realized that it was not working so I decided to do a little experiment. I decided to NOT mask the input and let the model just overfit on a small amount of data. However, the model predicts 0 for everything. 0 ids my token id for [PAD] which is the most common in the data. The model DOES NOT even overfit.

Here is the code for the model:

class Bert(tf.keras.Model):
  def __init__(self):
    super(Bert, self).__init__()
    self.vocab_size = 30000
    self.encoder = TFBertForMaskedLM(BertConfig(vocab_size=30000))
    self.optimizer = tf.keras.optimizers.Adam(0.0001)


  def train_step(self, inputs):
    loss_fn = tf.keras.losses.CategoricalCrossentropy(from_logits=True, reduction = tf.keras.losses.Reduction.NONE)
    with tf.GradientTape() as tape:
        predictions, _ = self(inputs)
        predictions_mlm = predictions.logits 
        mlm_labels = tf.one_hot(inputs[0], self.vocab_size ,axis=2)  #inputs[0] == encodings
        loss_mlm = loss_fn(mlm_labels,predictions_mlm)
        loss = tf.cast(tf.reduce_mean(loss_mlm),tf.float32)

    trainable_vars = self.encoder.trainable_variables
    gradients = tape.gradient(loss, trainable_vars)
    self.optimizer.apply_gradients(zip(gradients, trainable_vars))

    return loss

  def call(self,inputs):
    encodings, labels = inputs
    return self.encoder(encodings, training=True),labels

No matter what the input is, the model always predicts the output tokens to be 0 (which is ‘[PAD]’).
I have also tried a range of different learning rates. The result is the same.
I tried passing in random input instead of the encodings. I used the original encodings as labels and the behavior was identical to before. this shows that the model is completely ignoring the input and is looking at the labels only. Since ‘[PAD]’ is most common in the labels, it always outputs ‘[PAD]’. Is there something that is missing? any help is much appreciated.

Topic		Replies	Views
Masked Language Modeling (MLM) using TFBertForMaskedLM (Tensorflow) 🤗Transformers	4	590	January 21, 2021
Training loss is not decreasing using TFBertModel 🤗Transformers	4	5759	October 24, 2023
Having issues finetuning a Bert model pretrained from scratch on downstream task (GLUE Dataset)! Intermediate	0	714	March 26, 2022
TF transformers model inputs and outputs showing none? 🤗Transformers	1	1141	April 25, 2022
BertForMaskedLM training from scratch 🤗Transformers	0	1044	April 7, 2023

TFBertForMaskedLM outputting a identical prediction for every input

Related topics