How can I set different learning rates for different parameters in the model? I have rewritten the optimizers and separately set the learning rate for the act_fn in the model, but during training, I found that it doesn’t work:
if optimizer_grouped_parameters is None:
# Default parameter groups
decay_parameters = Trainer.get_decay_parameter_names(None, model)
optimizer_grouped_parameters = [
{
'params': [p for n, p in model.named_parameters() if (n in decay_parameters and p.requires_grad and
'act_fn' not in n)],
'weight_decay': args.weight_decay,
},
{
'params': [p for n, p in model.named_parameters() if (n not in decay_parameters and p.requires_grad)],
'weight_decay': 0.0,
},
{
"params": [
p for n, p in model.named_parameters() if (n in decay_parameters and p.requires_grad and
'act_fn' in n)
],
"weight_decay": 0.0,
'lr': 0.5
},
]
optimizer_cls, optimizer_kwargs = Trainer.get_optimizer_cls_and_kwargs(args)
decay_parameters = Trainer.get_decay_parameter_names(None, model)
optimizer_grouped_parameters = [
{
'params': [p for n, p in model.named_parameters() if (n in decay_parameters and p.requires_grad and
'act_fn' not in n)],
'weight_decay': args.weight_decay,
'lr': args.learning_rate, # Default learning rate
},
{
'params': [p for n, p in model.named_parameters() if (n not in decay_parameters and p.requires_grad)],
'weight_decay': 0.0,
'lr': args.learning_rate, # Default learning rate
},
{
'params': [p for n, p in model.named_parameters() if (n in decay_parameters and p.requires_grad and
'act_fn' in n)],
'weight_decay': 0.0,
'lr': 0.5, # Custom learning rate for act_fn
},
]
optimizer_cls, optimizer_kwargs = Trainer.get_optimizer_cls_and_kwargs(args)
optimizer = optimizer_cls(optimizer_grouped_parameters, **optimizer_kwargs)
# Debugging optimizer parameter groups
for i, param_group in enumerate(optimizer.param_groups):
print(f"Param group {i}: lr={param_group.get('lr', args.learning_rate)}, "
f"weight_decay={param_group['weight_decay']}")
Param group 0: lr=5e-05, weight_decay=0.1
Param group 1: lr=5e-05, weight_decay=0.0
Param group 2: lr=0.5, weight_decay=0.0
However, in transformer.trainer, after self.optimizer.step(), I also checked it with:
self.optimizer.step()
for i, param_group in enumerate(self.optimizer.optimizer.param_groups):
print(f"Param group {i}: lr={param_group.get('lr', args.learning_rate)}, "
f"weight_decay={param_group['weight_decay']}")
The output is: Param group 0: lr=5e-05, weight_decay=0.1
This is strange; there are no Param group 1 and 2. I am using DeepSpeed’s Zero3. Does this change the Param group?
The issue is caused by DeepSpeed. When using DeepSpeed, it results in parameter groups being merged into a single group. I am not sure how to configure it to prevent this merging.
I have some experience about that.
The issue with DeepSpeed merging parameter groups into a single group is related to how it handles optimization parameter groups. DeepSpeed, by default, optimizes parameters more efficiently by merging them into a single group to streamline operations like gradient updates and memory management. However, if you want to maintain separate parameter groups for different learning rates or other configurations, you need to adjust DeepSpeed’s configuration.
Solutions:
1. Use zero_allow_untested_optimizer in the DeepSpeed Config
DeepSpeed’s ZeRO optimizer merges parameter groups by default for memory efficiency. You can disable this behavior using the zero_allow_untested_optimizer flag in the DeepSpeed configuration file. For example: