What is transfer learning and why is it needed?

I am using Hugging Face Models for NLP tasks.

I see a lot of online examples like below which freezes the bottom layers and only train only few top layers. Basic idea is to use transfer learning, without having to train model from scratch. Which makes sense to me.

for layer in model.layers[:-2]:
    layer.trainable = False

However i find that accuracy of my model improves significantly when train it from scratch, Are there any downsides to train transformer model from scratch? What is the right approach? When should you unfreeze or freeze layers? Please advice

Transfer learning is using someone else’s trained model as your model’s initial weights. Training from scratch is using random numbers for your model’s initial weights.

Training from scratch requires a lot of data and a lot of resources.

You might need to train from scratch if your data is completely different from the standard data. For example, if it is in a different language, or chemical symbols. Otherwise, it will probably be better to use transfer learning, starting from the closest kind of data you can find.

A lot of people do Intermediate training. That is where you use your data, but not your final downstream task.

For example, you might choose to start with a pre-trained BERT, such as bert-base-uncased. Then you might do Masked Language Modelling using your text data. Finally, you might do Sequence Classification training, using your text and your labels.

If your text is quite similar to the BERT corpus (wikipedia plus books), then you could probably get results by unfreezing only half the BERT layers. If your text is very different, you might get better results if you unfreeze all the layers.

If your text is very similar to the BERT corpus, you might not need to do intermediate training, and you might not need to unfreeze any layers. If the results from using pre-trained BERT with your downstream task are “good enough”, then stop there.

The more you unfreeze, the longer the training will take.

I don’t know whether you should freeze the same layers for your downstream task training as for your intermediate training. Maybe you could try freezing half the layers for intermediate training, but freezing all the layers for your downstream task training.