BERT from scratch without self-supervised learning

aclifton314 · October 13, 2022, 5:12pm

Suppose one copies or creates the bert-base architecture, meaning the model layers themselves and not the training curriculum (MLM and NSP). Next suppose that one adds on a classifier head to the copied bert-base architecture that
consists of a single linear layer to make predictions over the set of classes associated with a dataset. One then randomizes the model’s parameters and begins training this model on a labeled dataset using supervised learning only.
Namely, with the preprocessed data (that includes positional embeddings), this data is passed all the way through the bert-base architecture and the linear classifier layer to produce prediction set over the class set,
a loss is calculated, and the weights are updated via backpropagation and stochastic gradient descent.

My question is, would this be a good idea? Is there anything about this approach (compared to the self-supervised followed by task specific training curriculum) that would prevent one from obtaining decent metrics on a test set?
As I understand, the bert authors had a lot of unlabelled data but suppose one had an equivalent amount of labelled data for a particular domain (let’s say sentiment about movie reviews). Is there any reason why the above approach would
produce poor results?

Topic		Replies	Views
Pre-Train BERT (from scratch) Research	43	18988	June 27, 2022
How to train BERT from scratch on a new domain for both MLM and NSP? Models	2	2292	February 6, 2021
Pre-training BERT Models	1	381	May 21, 2024
Bert with different layer architecture (Monarch Mixer) without pretrained weights Models	2	170	March 12, 2024
Pretraining Models from Scratch vs Further Training 🤗Transformers	0	269	November 28, 2023

BERT from scratch without self-supervised learning

Related topics