Hello. I remember one of the tricks that Jeremy Howard used to do was freeze and train a projection layer, and then unfreeze and train the whole network. I was wondering if this is done in huggingface transformers as well?
Reason I ask is, I remember seeing a discussion between (I think) Hamel Husain and Sylvain Gugger where the latter mentioned that this doesn’t work for transformers and that you get to an unrecoverable point if you train this way. In my simple (non-exhaustive) experiments that I have done, I’ve seen the freezing/ unfreezing method does ok, but just simply having a lower learning rate for everything works slightly better.
I was wondering what the current advice in 2022 is? If I were to freeze only the first 2-3 layers + embeddings would that do better. My concern is that training everything including embeddings will only adjust things the limited training set would see. Eg. adjusting embeddings sounds like a bad idea considering it would only adjust embeddings due to vocabulary seen in training set.