How to correctly evaluate a Masked Language Model?

denocris · August 31, 2021, 1:47pm

Hi All,

my question is very simple. Starting from a pre-trained (Italian) model, I fine-tuned it on a specific domain of interest, say X, using masked language model (MLM) training. I then computed perplexity on a test text on domain X and checked that the final model performs better than the pre-trained one.

Is it sufficient?

People who trained this language model Umberto using MLM training, performed several tests on downstream tasks such as NER and POS.

How did they do?

My understanding is that starting from Umberto, they fine-tune it on NER and test it on WikiNER-ITA for instance. Is there a simple procedure to do it?

Thanks!

jonathanalis · February 13, 2022, 2:10am

In RoBerta they use accuracy and f1 scores of the language model. O got this code that I think performs the accuracy:

import sklearn
from datasets import load_metric
import numpy as np
metric = load_metric(“accuracy”)

def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)

indices = [[i for i, x in enumerate(labels[row]) if x != -100] for row in range(len(labels))]

labels = [labels[row][indices[row]] for row in range(len(labels))]
labels = [item for sublist in labels for item in sublist]

predictions = [predictions[row][indices[row]] for row in range(len(predictions))]
predictions = [item for sublist in predictions for item in sublist]

results = metric.compute(predictions=predictions, references=labels)
results["eval_accuracy"] = results["accuracy"]
print(results)
results.pop("accuracy")

return results

Then create a trainer and use this as the compute_metric parameter

from transformers import Trainer
trainer = Trainer(
model=model,
…
compute_metrics=compute_metrics,
)

Then

results = trainer.evaluate()
accuracy = results[‘eval_results’]

NHendrickson9616 · July 3, 2023, 4:55pm

Very much late to the party here, but I have been struggling to find perplexity with my BERT model. How exactly did you determine the perplexity of yours?

lkurlandski · August 11, 2023, 2:05pm

@NHendrickson9616 Perplexity is really only well-suited for causal/autoregressive language models, e.g., GPT. Since it only uses information from the tokens on the left-hand-side, it wouldn’t really provide valuable feedback for a model like BERT, which leverages bidirectional information.

Topic		Replies	Views
Accuracy of Masked LM training Beginners	0	1029	June 15, 2022
Metrics for masked language modeling (mlm) Beginners	0	503	September 16, 2021
Getting the MLM accuracy for the BERT model I am training from scratch Beginners	7	5354	October 5, 2023
Accuracy of MLM model 🤗Transformers	5	1524	July 13, 2021
Evaluation metrics for BERT-like LMs Research	4	4613	December 6, 2024

How to correctly evaluate a Masked Language Model?

Related topics