How to only finetune the last layer of ALBERT?

haoxiaohao · April 30, 2021, 1:51am

AlbertModel(
(embeddings): AlbertEmbeddings(
(word_embeddings): Embedding(30000, 128, padding_idx=0)
(position_embeddings): Embedding(512, 128)
(token_type_embeddings): Embedding(2, 128)
(LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): AlbertTransformer(
(embedding_hidden_mapping_in): Linear(in_features=128, out_features=768, bias=True)
(albert_layer_groups): ModuleList(
(0): AlbertLayerGroup(
(albert_layers): ModuleList(
(0): AlbertLayer(
(full_layer_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(attention): AlbertAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(attention_dropout): Dropout(p=0.1, inplace=False)
(output_dropout): Dropout(p=0.1, inplace=False)
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)
(ffn): Linear(in_features=768, out_features=3072, bias=True)
(ffn_output): Linear(in_features=3072, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
)
)
(pooler): Linear(in_features=768, out_features=768, bias=True)
(pooler_activation): Tanh()
)

As we can see, ALBERT has only a ModuleList. And I am not sure how to only finetune the last layer out of the 12 layers in total. Thanks!

ehalit · April 30, 2021, 5:21am

You can access the name of the parameters through model.state_dict().keys() and customize the optimizer according to the name of the parameters in the corresponding layers, for example if you set
optimizer_grouped_parameters = [{'params': [p for n, p in model.named_parameters() if "pooler" in n], 'weight_decay': 0.01}]
and then initialize the optimizer with optimizer = AdamW(optimizer_grouped_parameters, lr=1e-5), only the pooler layer will get updated during training.

If you check the weights of the pooler layer after learning with model.pooler.weight, it will be different from the initial model. However, other layers will have the same weights with the initial model.

nielsr · April 30, 2021, 8:37am

In PyTorch, you can easily set requires_grad to False for whatever parameters you don’t want to be updated.

You can do the following:

for name, param in model.named_parameters():
      if name not in "...":
         param.requires_grad = False

haoxiaohao · April 30, 2021, 8:55am

Thanks for your reply! I think my question is how to index the last encoder layer of the AlBert model. In BertModel, the last encoder layer can be indexed by model.encoder.layer[-1]. But in this ALBERT model, it only has a modulelist, so I don’t know how to index the last encoder layer.

ehalit · April 30, 2021, 9:13am

I am not entirely sure, but since ALBERT applies parameter sharing across layers (see the documentation for AlbertConfig num_hidden_groups (int, optional, defaults to 1) – Number of groups for the hidden layers, parameters in the same group are shared.), selectively updating a layer might not be possible.

Topic		Replies	Views
Is it possible to freeze certain layer in ALBERT for Fine Tune? 🤗Transformers	0	28	December 24, 2024
Darshan Hiranandani : Freezing Layers in ALBERT for Fine-Tuning: Feasible with TensorFlow? 🤗Transformers	0	13	December 27, 2024
Way to fine tune pre trained model & get the embeddings Intermediate	2	3549	May 28, 2024
Finetune language model for feature extraction 🤗Transformers	0	391	July 1, 2021
How to do selective masking in Language modeling 🤗Transformers	3	528	August 13, 2020

How to only finetune the last layer of ALBERT?

Related topics