Doing classification 100% from scratch?

olaffson · September 3, 2021, 12:57am

Hello there!

I have a quite specific corpus and I wanted to try an approach where I do everything from scratch (well, almost!). Specifically, I was thinking about:

training a language model from scratch using my own corpus and creating my own tokenizer. Hugginface provides two colab for that (see Google Colab for the LM)
Then, simply loading the language model created at 1. and fine-tuning it for text classification using the usual imports

from transformers import AutoModelForSequenceClassification 
model = AutoModelForSequenceClassification.from_pretrained("my_language_model")
#do some classification
#save final model

Does that make sense? Are these the right conceptual steps?
Thanks!

nielsr · September 3, 2021, 12:22pm

Yes, that are the right steps: first pre-training, than fine-tuning a head. However, note that you need a relatively big corpus in order for pre-training to be effective. If you really have some in-domain data that’s different from the corpora that were used to pre-train models like BERT and RoBERTa, then it might be useful to do it.

Examples are BioBERT (pre-trained on biomedical language), SciBERT (pre-trained on scientific text), etc.

olaffson · September 3, 2021, 12:56pm

thanks @nielsr, I saw this myriad of yourmodelBERT models. I just wonder: is BERT necessarily the best architecture for training a language model from scratch? Why have people focused on this particular model (which is almost an old model by today’s standards)

nielsr · September 3, 2021, 2:50pm

BERT is the one that started it all, and it still works really well. All other model variants (cf. DeBERTa, RoBERTa, ConvBERT, Funnel Transformer, etc.) are just small tweaks to the original BERT architecture. They are fancy but research has shown that none of them actually improves upon the original Transformer by a large margin.

olaffson · September 17, 2021, 4:25pm

@nielsr just a follow up if you have a moment. The TF notebook for language modeling actually mention two different tasks: causal language modeling and masked language modeling.

For the purpose of training a classifier on top of the model I train from scratch, are the two basic tasks equivalent? That is I can train a causal language modeling and then train a classifier with it or I can train a masked language model and then train the classifier. Are both approaches OK conceptually?

Thanks!

Topic		Replies	Views
Training a language model from scratch with tensorflow (not pytorch)? Intermediate	4	859	August 9, 2021
Further pre-train language model in transformers like BERT Models	3	1108	March 27, 2022
SpanBERT, ELECTRA, MARGE from scratch? Beginners	5	1380	July 22, 2023
Saving underlying language model after trained on downstream task 🤗Transformers	0	423	September 14, 2020
Training BERT from scratch with Wikipedia + Book Corpus Dataset 🤗Transformers	1	4644	January 22, 2021

Doing classification 100% from scratch?

Related topics