Evaluation metrics for BERT-like LMs

Hey guys,

I’ve read that Perplexity (PPL) is one of the most common metrics for evaluating autoregressive and causal language models. But what do we use for MLMs like BERT?

I need to evaluate BERT models after pre-training and compare them to existing BERT models without going through downstream task GLUE-like benchmarks.


I found an interesting project https://github.com/awslabs/mlm-scoring which seems to be the step in the right direction. The authors also published the paper https://arxiv.org/pdf/1910.14659v2.pdf

1 Like

Hi Vladimir,

before releasing new models, I usually perform evaluations for multiple checkpoints on at least two downstream tasks (normally Pos tagging or NER).

But maybe you can also evaluate the MLM capability for some checkpoints, like it is shown in the following paper:

I would use the “Cloze test word prediction” task. It masks out some subwords from an input sentence, tries to re-construct the masked subwords and calculates accuracy. With that task you could at least measure the MLM capability of your checkpoints, without performing extensive hyper-parameter search and multiple runs as you do for down-stream tasks.

Thanks a lot @stefan-it I see the project is using the old HF naming scheme but it shouldn’t be hard to update.