BERT from scratch without self-supervised learning

Suppose one copies or creates the bert-base architecture, meaning the model layers themselves and not the training curriculum (MLM and NSP). Next suppose that one adds on a classifier head to the copied bert-base architecture that
consists of a single linear layer to make predictions over the set of classes associated with a dataset. One then randomizes the model’s parameters and begins training this model on a labeled dataset using supervised learning only.
Namely, with the preprocessed data (that includes positional embeddings), this data is passed all the way through the bert-base architecture and the linear classifier layer to produce prediction set over the class set,
a loss is calculated, and the weights are updated via backpropagation and stochastic gradient descent.

My question is, would this be a good idea? Is there anything about this approach (compared to the self-supervised followed by task specific training curriculum) that would prevent one from obtaining decent metrics on a test set?
As I understand, the bert authors had a lot of unlabelled data but suppose one had an equivalent amount of labelled data for a particular domain (let’s say sentiment about movie reviews). Is there any reason why the above approach would
produce poor results?