During the fine-tuning of a transformer it is common to define the optimizer as follows:
class MyModel():
def __init__(self, hparams):
super(MyModel, self).__init__()
self.hparams = hparams
# Bert Model
self.bert_encoder = BertModel.from_pretrained(
hparams.architecture,
output_attentions=hparams.output_attentions
)
# loss
self.loss = self.get_loss(self.hparams.loss, self.hparams.loss_hparams)
def configure_optimizers(self):
return torch.optim.Adam(
self.bert_encoder.parameters(),
lr=self.hparams.lr,
betas=self.hparams.betas,
eps=1e-08,
weight_decay=self.hparams.weight_decay,
amsgrad=True)
which is a straightforward approach to update all model parameters leveraging loss function output.
How would the code snippet above be if someone wanted to update only the parameters related to the self-attention sub-layers by one optimizer Opt1 (associated to a loss L1), and the remaining parameters employing a second optimizer Opt2 (associated with a second loss L2)?