How to correctly freeze some of the Wav2Vec2-Bert's layers?

Hi everyone!

I’m following the following blog post for fine tuning the W2V2-Bert for a low resource language - Fine-Tune W2V2-Bert for low-resource ASR with 🤗 Transformers .

In the training phase I attempted to freeze some of the layers, say first 21 out of 23 using the following piece of code:

for name, param in model.named_parameters():
  if (name not in ['lm_head.bias', 'lm_head.weight']) and ("encoder.layers.23" not in name) and ("encoder.layers.22" not in name):
          param.requires_grad = False

However, what I’m seeing is that only the language modelling head is getting trained and none of the layers get any updates, not even layers 22-23 that has requires_grad = True.

One easy work around that I found was to pass the optimizer two groups of parameters. Layers 1-21 in one group having 0 learning rate and layers 22-23 in the second having 2e-5 learning rate (I’m using constant scheduler). Another way was to pass only the second set of parameters to the optimizer. However, in both cases my optimizer is calculating the differentials with respect to all the parameters and is wasting compute.

The code above works for simple feed forward neural networks using pytorch. So I was wondering if this is an issue with transformers library? If not can anyone tell me now to properly freeze layers ?

Thanks!
@sgugger