Doing classification 100% from scratch?

Hello there!

I have a quite specific corpus and I wanted to try an approach where I do everything from scratch (well, almost!). Specifically, I was thinking about:

  1. training a language model from scratch using my own corpus and creating my own tokenizer. Hugginface provides two colab for that (see Google Colab for the LM)

  2. Then, simply loading the language model created at 1. and fine-tuning it for text classification using the usual imports

from transformers import AutoModelForSequenceClassification 
model = AutoModelForSequenceClassification.from_pretrained("my_language_model")
#do some classification
#save final model

Does that make sense? Are these the right conceptual steps?

Yes, that are the right steps: first pre-training, than fine-tuning a head. However, note that you need a relatively big corpus in order for pre-training to be effective. If you really have some in-domain data that’s different from the corpora that were used to pre-train models like BERT and RoBERTa, then it might be useful to do it.

Examples are BioBERT (pre-trained on biomedical language), SciBERT (pre-trained on scientific text), etc.

1 Like

thanks @nielsr, I saw this myriad of yourmodelBERT models. I just wonder: is BERT necessarily the best architecture for training a language model from scratch? Why have people focused on this particular model (which is almost an old model by today’s standards)

BERT is the one that started it all, and it still works really well. All other model variants (cf. DeBERTa, RoBERTa, ConvBERT, Funnel Transformer, etc.) are just small tweaks to the original BERT architecture. They are fancy but research has shown that none of them actually improves upon the original Transformer by a large margin.


@nielsr just a follow up if you have a moment. The TF notebook for language modeling actually mention two different tasks: causal language modeling and masked language modeling.

For the purpose of training a classifier on top of the model I train from scratch, are the two basic tasks equivalent? That is I can train a causal language modeling and then train a classifier with it or I can train a masked language model and then train the classifier. Are both approaches OK conceptually?