How to set different learning rates for different parameters in the model?

Alanturner2 · December 16, 2024, 2:29pm

I have some experience about that.
The issue with DeepSpeed merging parameter groups into a single group is related to how it handles optimization parameter groups. DeepSpeed, by default, optimizes parameters more efficiently by merging them into a single group to streamline operations like gradient updates and memory management. However, if you want to maintain separate parameter groups for different learning rates or other configurations, you need to adjust DeepSpeed’s configuration.

Solutions:

1. Use `zero_allow_untested_optimizer` in the DeepSpeed Config

DeepSpeed’s ZeRO optimizer merges parameter groups by default for memory efficiency. You can disable this behavior using the zero_allow_untested_optimizer flag in the DeepSpeed configuration file. For example:

json

Copy code

{
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": 0.001
    }
  },
  "zero_optimization": {
    "stage": 2,
    "allgather_bucket_size": 2e8,
    "reduce_scatter": true,
    "zero_allow_untested_optimizer": true
  }
}

This flag prevents DeepSpeed from enforcing its internal parameter group merging.

2. Define Custom Parameter Groups

When initializing the optimizer in your code, explicitly define parameter groups before passing them to DeepSpeed. For example:

python

Copy code

optimizer_grouped_parameters = [
    {"params": model.base_parameters, "lr": 1e-3},
    {"params": model.special_parameters, "lr": 1e-4}
]

optimizer = torch.optim.AdamW(optimizer_grouped_parameters)
model, optimizer, _, _ = deepspeed.initialize(
    model=model,
    optimizer=optimizer,
    config_params=deepspeed_config
)

Ensure zero_allow_untested_optimizer is enabled if you’re using ZeRO optimization.

Topic		Replies	Views
Learning rate setting 🤗Transformers	1	2083	November 16, 2020
How to use different learning rates when deepspeed enabled DeepSpeed	1	44	June 14, 2025
Tensorboard support when using optimizer with 2 separate learning rates Intermediate	0	364	October 9, 2021
How to separate the parameters of a transformer into groups? 🤗Transformers	0	272	April 23, 2021
Bert model on Acceptability Judgement Task \|\| Optimizer Grouped Parameters Beginners	0	563	September 11, 2021

How to set different learning rates for different parameters in the model?

Solutions:

1. Use zero_allow_untested_optimizer in the DeepSpeed Config

2. Define Custom Parameter Groups

Related topics

1. Use `zero_allow_untested_optimizer` in the DeepSpeed Config