Getting the MLM accuracy for the BERT model I am training from scratch

Hi
I am training a BERTforMaskedLM model from scratch.
This is my tokenizer (previously trained)

tokenizer = BertTokenizer('vocab.txt')

This is my config:

config = BertConfig(
    vocab_size=20000,
    max_position_embeddings=258
)

This is how I load the model from the last checkpoint:

model = BertForMaskedLM.from_pretrained("/BERT/bert-checkpoints/checkpoint-1726500",config=config)

My data collator:

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

The compute metrics function:


from datasets import load_metric

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {metric.compute(predictions=predictions, references=labels)}

The training arguments:

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./bert-checkpoints",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    save_steps=500,
    save_total_limit=2,
    prediction_loss_only=True,
    evaluation_strategy = 'steps'
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
    eval_dataset=dataset,
    compute_metrics=compute_metrics
)

And finally, I use this to train the model (which works fine):

trainer.train()

However, when the model is being trained, I only see 3 metrics: Step, Training Loss, and Validation Loss. I also want to see the “Accuracy of the Masked Language Model” (MLM accuracy). How should I do that? Note that I have already defined the “compute_metrics” function which has the “accuracy”. I do not know what is wrong. But the accuracy is not being shown.

Note: by the way, my dataset is an instance of the from torch.utils.data import Dataset object which has a member called “examples”. For instance, dataset.examples[0] is [2, 507, 157, 3656, 117, 2100, 521, 122, 280, 3]

How many total steps are there in your training? Since you chose the "steps" strategy, I wonder if it’s just because evaluation is never run?

I think the total steps are “1963915” if i’m not wrong. I attached a picture relating to it. Please take a look.

Note: this is the picture for when I do not specify the “evaluation strategy”. If I specify that, then the validation loss will also appear. (But again, not the accuracy).

It is also worth mentioning that the model is really being trained. This is because when I use the fill_mask function and load my model, the masked tokens are predicted in a great way! It’s just I want to report the “accuracy”.

Are you sure your evaluation dataset contains labels? What’s the output of trainer.predict applied to your evaluation dataset?

After running trainer.predict(dataset) , the output is:

PredictionOutput(predictions=None, label_ids=None, metrics={'test_loss': 1.8579237461090088, 'test_runtime': 0.1819, 'test_samples_per_second': 549.649, 'test_steps_per_second': 21.986})

So, we know that my dataset should have labels for the accuracy to be measured. My question is: How should I add the labels?

It’s not that you don’t have labels, it’s that you don’t have anything: predictions is also None. My guess would be that there are not enough samples in your dataset to form a batch.

its giving this error after you give the predictions and labels
TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'

Hi, @arm-on. I recently also do the work as you did: pre-train MLM task from scratch using my own domain data. I notice that you said your model perform pretty well. I wonder if you still remember the final loss of your model for MLM task? Currently my model loss is around 4.5, and I’m trying to find a way to evaluate it. Thank you so much!! :pleading_face: