How to only finetune the last layer of ALBERT?

AlbertModel(
(embeddings): AlbertEmbeddings(
(word_embeddings): Embedding(30000, 128, padding_idx=0)
(position_embeddings): Embedding(512, 128)
(token_type_embeddings): Embedding(2, 128)
(LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): AlbertTransformer(
(embedding_hidden_mapping_in): Linear(in_features=128, out_features=768, bias=True)
(albert_layer_groups): ModuleList(
(0): AlbertLayerGroup(
(albert_layers): ModuleList(
(0): AlbertLayer(
(full_layer_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(attention): AlbertAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(attention_dropout): Dropout(p=0.1, inplace=False)
(output_dropout): Dropout(p=0.1, inplace=False)
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)
(ffn): Linear(in_features=768, out_features=3072, bias=True)
(ffn_output): Linear(in_features=3072, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
)
)
(pooler): Linear(in_features=768, out_features=768, bias=True)
(pooler_activation): Tanh()
)

As we can see, ALBERT has only a ModuleList. And I am not sure how to only finetune the last layer out of the 12 layers in total. Thanks!

You can access the name of the parameters through model.state_dict().keys() and customize the optimizer according to the name of the parameters in the corresponding layers, for example if you set
optimizer_grouped_parameters = [{'params': [p for n, p in model.named_parameters() if "pooler" in n], 'weight_decay': 0.01}]
and then initialize the optimizer with optimizer = AdamW(optimizer_grouped_parameters, lr=1e-5), only the pooler layer will get updated during training.

If you check the weights of the pooler layer after learning with model.pooler.weight, it will be different from the initial model. However, other layers will have the same weights with the initial model.

In PyTorch, you can easily set requires_grad to False for whatever parameters you don’t want to be updated.

You can do the following:

for name, param in model.named_parameters():
      if name not in "...":
         param.requires_grad = False

Thanks for your reply! I think my question is how to index the last encoder layer of the AlBert model. In BertModel, the last encoder layer can be indexed by model.encoder.layer[-1]. But in this ALBERT model, it only has a modulelist, so I don’t know how to index the last encoder layer.

I am not entirely sure, but since ALBERT applies parameter sharing across layers (see the documentation for AlbertConfig num_hidden_groups (int, optional, defaults to 1) – Number of groups for the hidden layers, parameters in the same group are shared.), selectively updating a layer might not be possible.