How to separate the parameters of a transformer into groups?

celsofranssa · April 23, 2021, 11:44pm

During the fine-tuning of a transformer it is common to define the optimizer as follows:

class MyModel():

    def __init__(self, hparams):
        super(MyModel, self).__init__()
        self.hparams = hparams

        # Bert Model
        self.bert_encoder = BertModel.from_pretrained(
            hparams.architecture,
            output_attentions=hparams.output_attentions
        )

        # loss
        self.loss = self.get_loss(self.hparams.loss, self.hparams.loss_hparams)

    def configure_optimizers(self):
        return torch.optim.Adam(
            self.bert_encoder.parameters(),
            lr=self.hparams.lr, 
            betas=self.hparams.betas, 
            eps=1e-08,
            weight_decay=self.hparams.weight_decay, 
            amsgrad=True)

which is a straightforward approach to update all model parameters leveraging loss function output.

How would the code snippet above be if someone wanted to update only the parameters related to the self-attention sub-layers by one optimizer Opt1 (associated to a loss L1), and the remaining parameters employing a second optimizer Opt2 (associated with a second loss L2)?

Topic		Replies	Views
Bert model on Acceptability Judgement Task \|\| Optimizer Grouped Parameters Beginners	0	556	September 11, 2021
Does fine-tuning a language model modify its hidden weights? Intermediate	1	595	August 10, 2021
How to fine-tune the output head of the pre-trained Transformer models? 🤗Transformers	0	489	October 19, 2020
Adjusting parameters for the FC layers at the end 🤗Transformers	1	1875	July 20, 2021
Two questions when I wraped the AutoModelForMaskedLM 🤗Transformers	7	28	March 21, 2025

How to separate the parameters of a transformer into groups?

Related topics