Getting the log probability of a sentence with BERT

Hi all,

I recently came across LM-critic, which main idea is to assess the grammaticality of two similar sentences. Since LM-critic uses Huggingface GPT2LMHeadModel, I decided to experiment with BertLMHeadModel instead, but the results are very poor (~60%) compared to those of GPT2 (~90%).

Without going deeper into the details of my comparison (I’m planning to share a link with the code soon), I was wondering if the reason behind BERT’s poor performance in this task could be explained by the different training objectives of these two models.

I’ll be happy to read your thoughts.