How to set different learning rates for different parameters in the model?

I have some experience about that.
The issue with DeepSpeed merging parameter groups into a single group is related to how it handles optimization parameter groups. DeepSpeed, by default, optimizes parameters more efficiently by merging them into a single group to streamline operations like gradient updates and memory management. However, if you want to maintain separate parameter groups for different learning rates or other configurations, you need to adjust DeepSpeed’s configuration.

Solutions:

1. Use zero_allow_untested_optimizer in the DeepSpeed Config

DeepSpeed’s ZeRO optimizer merges parameter groups by default for memory efficiency. You can disable this behavior using the zero_allow_untested_optimizer flag in the DeepSpeed configuration file. For example:

json

Copy code

{
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": 0.001
    }
  },
  "zero_optimization": {
    "stage": 2,
    "allgather_bucket_size": 2e8,
    "reduce_scatter": true,
    "zero_allow_untested_optimizer": true
  }
}

This flag prevents DeepSpeed from enforcing its internal parameter group merging.


2. Define Custom Parameter Groups

When initializing the optimizer in your code, explicitly define parameter groups before passing them to DeepSpeed. For example:

python

Copy code

optimizer_grouped_parameters = [
    {"params": model.base_parameters, "lr": 1e-3},
    {"params": model.special_parameters, "lr": 1e-4}
]

optimizer = torch.optim.AdamW(optimizer_grouped_parameters)
model, optimizer, _, _ = deepspeed.initialize(
    model=model,
    optimizer=optimizer,
    config_params=deepspeed_config
)

Ensure zero_allow_untested_optimizer is enabled if you’re using ZeRO optimization.

2 Likes