I am doing Sentiment Analysis over some text reviews, but I do not get good results from.
I use BERT for feature extraction and a Fully Connected as classifier.
I am going to do these experiments, but I do not have any overview of the results in general. I have two options:
1- Unfreeze some Transfomer layers and let the gradient propagate over that layers
2- Do pre-train the BERT with masked language over related texts and then use classifier.
Which one has the priority? Or it depends just on experiments?
- Currently, it seems that the consensus is that to get the best results when fine-tuning on a downstream task you don’t freeze any layers at all. If you’re freezing the weights to save up on memory, then I’d suggest considering Adapter Framework. The idea of it is, basically, to insert additional trainable layers in-between existing frozen layers of a Transformer model. It should help, but there’s no guarantee that the results will be on par with full fine-tuning.
- Here I assume that you mean fine-tuning an existing pre-trained BERT with MLM objective. This may help, but it depends on the kind of texts you’re trying to classify. If you have a reason to believe that these texts are noticeably different from the texts that BERT was trained, then it’s likely to improve the results, although it may hinder its generalization ability.
It’s a safe bet to say that just unfreezing the weights will be the most advantageous, so I’d start with that, if it’s an option.
I write this for someone who is going to do experience in future.
I have tested unfreezing on my dataset but it seems going to be overfitted.
trainning loss = .0003 , acc ~ 95%
validation loss = 4.3 , acc ~ 40%
So, I am going to try next option, train a BERT model.