How to correctly evaluate a Masked Language Model?

Hi All,

my question is very simple. Starting from a pre-trained (Italian) model, I fine-tuned it on a specific domain of interest, say X, using masked language model (MLM) training. I then computed perplexity on a test text on domain X and checked that the final model performs better than the pre-trained one.

Is it sufficient?

People who trained this language model Umberto using MLM training, performed several tests on downstream tasks such as NER and POS.

How did they do?

My understanding is that starting from Umberto, they fine-tune it on NER and test it on WikiNER-ITA for instance. Is there a simple procedure to do it?


In RoBerta they use accuracy and f1 scores of the language model. O got this code that I think performs the accuracy:

import sklearn
from datasets import load_metric
import numpy as np
metric = load_metric(“accuracy”)

def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)

indices = [[i for i, x in enumerate(labels[row]) if x != -100] for row in range(len(labels))]

labels = [labels[row][indices[row]] for row in range(len(labels))]
labels = [item for sublist in labels for item in sublist]

predictions = [predictions[row][indices[row]] for row in range(len(predictions))]
predictions = [item for sublist in predictions for item in sublist]

results = metric.compute(predictions=predictions, references=labels)
results["eval_accuracy"] = results["accuracy"]

return results

Then create a trainer and use this as the compute_metric parameter

from transformers import Trainer
trainer = Trainer(


results = trainer.evaluate()
accuracy = results[‘eval_results’]