That’s because by default, we are training all parameters of a model. Hence, we will compute gradients for all parameters, and update them using gradient descent.
Does
param.requires_grad==Truemean that particular layer is freezed? I am confused with wordingrequires_grad. Does it mean freezed?
No, requires_grad=True means that a parameter will get updated if you start training. To freeze, you don’t want to have gradients, hence you would need to set requires_grad to False.
if i want to train some of the previous layers as shown here , should I use below code?
When fine-tuning language models such as BERT, RoBERTa, LongFormer, etc., we typically update all layers. However, recent research showed that this is actually not necessary, you can get similar results just by fine-tuning the biases (!) of the layers.
considering it takes a lot of time to train, is there specific recommendation regarding layers that I should train?
The default is just training all layers, hence you don’t need to set requires_grad to False anywhere.
- Do i need add any additional layers such as
dropoutor is that already taken care byLongformerForSequenceClassification.from_pretrained? I am not seeing any dropout layers in the above output and that’s why asking the question
This model already includes dropout. It’s not shown when printing the layers as dropout doesn’t have any trainable parameters (no weights or biases). You can see that it’s already included here. As can be seen, the classifier that is placed on top of the base LongFormer model is a LongformerClassificationHead, which includes a dropout layer.