Metrics for masked language modeling (mlm)

i want to compare the fit of my bert model before and after performing mlm for some epochs on my own textual data.

according to Perplexity of fixed-length models — transformers 4.10.1 documentation perplexity isn’t well defined for mlm. which metric should i use instead? so far i calculated the accuracy on the masked tokens only (compare actual labels of masked inputs with the token predicted by the model for the masked position. this is obviously not a good metric, since it punishes synonyms etc. like any other false prediction)