Continuing our discussion on Github…
You are definitely correct when saying that it might be unfeasible to train from scratch as they initially did, specially due to the size of your data.
On the other hand, imagine that you have a pre-trained BERT/RoBERTa model and you attach the LMHead on top of it. You could freeze the pre-trained parameters from the initial pre-trained BERT or even attach a small learning rate to this part of architecture, while you fine-tune the LMHead with a more aggressive rate using your own data and a causal language modeling (CLM) objective. The idea behind this would be to attempt to adapt the pre-trained BERT and start understanding how to model a CLM task, directly on your data and without losing some features that it may have already learned from the previous training. Nonetheless, it just an initial thought that I had and do not know how it will work in “real-world”, as my experience is just based on directly working with autoregressive models, such as GPT and Transformer-XL.
Regarding some pre-trained models for language generation / CLM, there are few that I could found by tagging text-generation and bert: Models - Hugging Face. However, I can not assure whether they were trained with masked LM or CLM, as there were no model cards with descriptions.
Regarding the evaluation metric, it is sure a challenge to define an appropriate metric or even just relying on the loss/perplexity. The problem with loss and perplexity is that they might mislead us when comparing models with close values because it strictly relies on the conditional probability of a token being generated given the previous tokens, so essentially we are trying to match the information according to a given target, whereas that given target might be valid if employed with some variations.
For example:
A sample in the test set “Hello, how are you” might give different perplexity when comparing to a generated prompt like “Hello, how you doing” and “Hello, how it is going”, even though they might have similar meaning, semantically speaking.
I have seen some works that attempt to employ a exact match or even a partial match metric, trying to correlate the n-grams between a generated text and a reference (test sample), in the same way as BLEU, METEOR and ROUGE would be applied to a machine translation task. A qualitative assessment is also pretty interesting, specially if the model is going to be deployed into a real-world application or something like. Unfortunately, we are still lacking some advancements on how to turn grammar, syntax and semantics into more proper quantitative metrics, but that might be changed in the near future… at least I hope so!