Getting the MLM accuracy for the BERT model I am training from scratch

I am training a BERTforMaskedLM model from scratch.
This is my tokenizer (previously trained)

tokenizer = BertTokenizer('vocab.txt')

This is my config:

config = BertConfig(

This is how I load the model from the last checkpoint:

model = BertForMaskedLM.from_pretrained("/BERT/bert-checkpoints/checkpoint-1726500",config=config)

My data collator:

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15

The compute metrics function:

from datasets import load_metric

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {metric.compute(predictions=predictions, references=labels)}

The training arguments:

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    evaluation_strategy = 'steps'

trainer = Trainer(

And finally, I use this to train the model (which works fine):


However, when the model is being trained, I only see 3 metrics: Step, Training Loss, and Validation Loss. I also want to see the “Accuracy of the Masked Language Model” (MLM accuracy). How should I do that? Note that I have already defined the “compute_metrics” function which has the “accuracy”. I do not know what is wrong. But the accuracy is not being shown.

Note: by the way, my dataset is an instance of the from import Dataset object which has a member called “examples”. For instance, dataset.examples[0] is [2, 507, 157, 3656, 117, 2100, 521, 122, 280, 3]

How many total steps are there in your training? Since you chose the "steps" strategy, I wonder if it’s just because evaluation is never run?

I think the total steps are “1963915” if i’m not wrong. I attached a picture relating to it. Please take a look.

Note: this is the picture for when I do not specify the “evaluation strategy”. If I specify that, then the validation loss will also appear. (But again, not the accuracy).

It is also worth mentioning that the model is really being trained. This is because when I use the fill_mask function and load my model, the masked tokens are predicted in a great way! It’s just I want to report the “accuracy”.

Are you sure your evaluation dataset contains labels? What’s the output of trainer.predict applied to your evaluation dataset?

After running trainer.predict(dataset) , the output is:

PredictionOutput(predictions=None, label_ids=None, metrics={'test_loss': 1.8579237461090088, 'test_runtime': 0.1819, 'test_samples_per_second': 549.649, 'test_steps_per_second': 21.986})

So, we know that my dataset should have labels for the accuracy to be measured. My question is: How should I add the labels?

It’s not that you don’t have labels, it’s that you don’t have anything: predictions is also None. My guess would be that there are not enough samples in your dataset to form a batch.

its giving this error after you give the predictions and labels
TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'