TFBertForMaskedLM outputting a identical prediction for every input

I have been trying to use the TFBertForMaskedLM. I realized that it was not working so I decided to do a little experiment. I decided to NOT mask the input and let the model just overfit on a small amount of data. However, the model predicts 0 for everything. 0 ids my token id for [PAD] which is the most common in the data. The model DOES NOT even overfit.

Here is the code for the model:

class Bert(tf.keras.Model):
  def __init__(self):
    super(Bert, self).__init__()
    self.vocab_size = 30000
    self.encoder = TFBertForMaskedLM(BertConfig(vocab_size=30000))
    self.optimizer = tf.keras.optimizers.Adam(0.0001)

  def train_step(self, inputs):
    loss_fn = tf.keras.losses.CategoricalCrossentropy(from_logits=True, reduction = tf.keras.losses.Reduction.NONE)
    with tf.GradientTape() as tape:
        predictions, _ = self(inputs)
        predictions_mlm = predictions.logits 
        mlm_labels = tf.one_hot(inputs[0], self.vocab_size ,axis=2)  #inputs[0] == encodings
        loss_mlm = loss_fn(mlm_labels,predictions_mlm)
        loss = tf.cast(tf.reduce_mean(loss_mlm),tf.float32)

    trainable_vars = self.encoder.trainable_variables
    gradients = tape.gradient(loss, trainable_vars)
    self.optimizer.apply_gradients(zip(gradients, trainable_vars))

    return loss

  def call(self,inputs):
    encodings, labels = inputs
    return self.encoder(encodings, training=True),labels

No matter what the input is, the model always predicts the output tokens to be 0 (which is ‘[PAD]’).
I have also tried a range of different learning rates. The result is the same.
I tried passing in random input instead of the encodings. I used the original encodings as labels and the behavior was identical to before. this shows that the model is completely ignoring the input and is looking at the labels only. Since ‘[PAD]’ is most common in the labels, it always outputs ‘[PAD]’. Is there something that is missing? any help is much appreciated.