Using BERT and RoBERTa for (causal?) language modeling

Hi all!

I’m hoping to use pretrained BERT/RoBERTa for language modeling, i.e. scoring the likelihood of sentences. There have been quite a few blog posts/issues on this, but no obvious consensus yet. I attempted an implementation with BertLMHeadModel and RobertaForCausalLM here, which had weird issues like:

  • RoBERTa has super large perplexity values, and
  • BERT cannot correctly compare the relative perplexity of simple sentences.
    (Please see more details in the Github issue above.)

@gugarosa kindly suggests that I shouldn’t evaluate pretrained BERT/RoBERTa directly, but should train them with causal LM objective beforehand. However, given the size of their pretraining data, it’s unlikely to retrain them myself. Is there any existing checkpoints that I can use directly? Or, would you recommend any other models (e.g. BertForMaskedLM?) or evaluation metrics (other than perplexity) instead?

Thanks in advance!

Continuing our discussion on Github…

You are definitely correct when saying that it might be unfeasible to train from scratch as they initially did, specially due to the size of your data.

On the other hand, imagine that you have a pre-trained BERT/RoBERTa model and you attach the LMHead on top of it. You could freeze the pre-trained parameters from the initial pre-trained BERT or even attach a small learning rate to this part of architecture, while you fine-tune the LMHead with a more aggressive rate using your own data and a causal language modeling (CLM) objective. The idea behind this would be to attempt to adapt the pre-trained BERT and start understanding how to model a CLM task, directly on your data and without losing some features that it may have already learned from the previous training. Nonetheless, it just an initial thought that I had and do not know how it will work in “real-world”, as my experience is just based on directly working with autoregressive models, such as GPT and Transformer-XL.

Regarding some pre-trained models for language generation / CLM, there are few that I could found by tagging text-generation and bert: Models - Hugging Face. However, I can not assure whether they were trained with masked LM or CLM, as there were no model cards with descriptions.

Regarding the evaluation metric, it is sure a challenge to define an appropriate metric or even just relying on the loss/perplexity. The problem with loss and perplexity is that they might mislead us when comparing models with close values because it strictly relies on the conditional probability of a token being generated given the previous tokens, so essentially we are trying to match the information according to a given target, whereas that given target might be valid if employed with some variations.

For example:

A sample in the test set “Hello, how are you” might give different perplexity when comparing to a generated prompt like “Hello, how you doing” and “Hello, how it is going”, even though they might have similar meaning, semantically speaking.

I have seen some works that attempt to employ a exact match or even a partial match metric, trying to correlate the n-grams between a generated text and a reference (test sample), in the same way as BLEU, METEOR and ROUGE would be applied to a machine translation task. A qualitative assessment is also pretty interesting, specially if the model is going to be deployed into a real-world application or something like. Unfortunately, we are still lacking some advancements on how to turn grammar, syntax and semantics into more proper quantitative metrics, but that might be changed in the near future… at least I hope so!

Some things aren’t clear to me from reading your initial post.

  • what is your end goal? Why do you want to score the likelihood of sentences?
  • why are you set on using a MLM-model like BERT/RoBERTa and training it yourself for an autoregressive problem? Why not use pretrained GPT-2 or the like for this?

The CoLA dataset (corpus on linguistic acceptability) tests for grammaticality/acceptability, which may be what you are after

1 Like

@gugarosa Thanks so much for your detailed response! Those are all valuable pointers and insights; will definitely take a closer look. I also noticed this work on MLM scoring, which might also be an alternative.

1 Like

Hi @BramVanroy, thanks for your reply! Regarding your questions:

  • My end goal is to score the likelihood of sentences using BERT/RoBERTa on my custom dataset. My dataset can be thought of as a semantic version of CoLA, where models decide which sentences are semantically likely vs. unlikely given a previous prompt.
  • I’m not really insisting on the CLM objective; instead, I’m just interested in seeing how all kinds of language models (including BERT, RoBERTa, GPT…) can perform on my dataset. I have already included the results with GPT2 and there’s no problem with it. I hope to also get results from MLM-based models as well.
  • CoLA has a training set, but my dataset unfortunately doesn’t – that’s why I’m hoping to evaluate an out-of-the-shelf model.

So you are not necessarily interested in the likelihood of the sentence in itself, but given a previous prompt/sentence? In that case the NLI benchmark might be related (though not exactly what you want). You can find finetuned models on the hub.

But to stick to your approach, I would not make use of MLM models for this use case. I don’t see how they relate - they are not generation (decoder) models. More interesting (imo) would be to compare GPT-2 with GPT-3 (you can request beta API access) in this context, or GPT-J.

Got it, thanks a lot for your help!