@nielsr just a follow up if you have a moment. The TF notebook for language modeling actually mention two different tasks: causal language modeling and masked language modeling.
For the purpose of training a classifier on top of the model I train from scratch, are the two basic tasks equivalent? That is I can train a causal language modeling and then train a classifier with it or I can train a masked language model and then train the classifier. Are both approaches OK conceptually?
Thanks!