- Currently, it seems that the consensus is that to get the best results when fine-tuning on a downstream task you don’t freeze any layers at all. If you’re freezing the weights to save up on memory, then I’d suggest considering Adapter Framework. The idea of it is, basically, to insert additional trainable layers in-between existing frozen layers of a Transformer model. It should help, but there’s no guarantee that the results will be on par with full fine-tuning.
- Here I assume that you mean fine-tuning an existing pre-trained BERT with MLM objective. This may help, but it depends on the kind of texts you’re trying to classify. If you have a reason to believe that these texts are noticeably different from the texts that BERT was trained, then it’s likely to improve the results, although it may hinder its generalization ability.
It’s a safe bet to say that just unfreezing the weights will be the most advantageous, so I’d start with that, if it’s an option.