I have some experience about that.
The issue with DeepSpeed merging parameter groups into a single group is related to how it handles optimization parameter groups. DeepSpeed, by default, optimizes parameters more efficiently by merging them into a single group to streamline operations like gradient updates and memory management. However, if you want to maintain separate parameter groups for different learning rates or other configurations, you need to adjust DeepSpeed’s configuration.
Solutions:
1. Use zero_allow_untested_optimizer
in the DeepSpeed Config
DeepSpeed’s ZeRO optimizer merges parameter groups by default for memory efficiency. You can disable this behavior using the zero_allow_untested_optimizer
flag in the DeepSpeed configuration file. For example:
json
Copy code
{
"optimizer": {
"type": "AdamW",
"params": {
"lr": 0.001
}
},
"zero_optimization": {
"stage": 2,
"allgather_bucket_size": 2e8,
"reduce_scatter": true,
"zero_allow_untested_optimizer": true
}
}
This flag prevents DeepSpeed from enforcing its internal parameter group merging.
2. Define Custom Parameter Groups
When initializing the optimizer in your code, explicitly define parameter groups before passing them to DeepSpeed. For example:
python
Copy code
optimizer_grouped_parameters = [
{"params": model.base_parameters, "lr": 1e-3},
{"params": model.special_parameters, "lr": 1e-4}
]
optimizer = torch.optim.AdamW(optimizer_grouped_parameters)
model, optimizer, _, _ = deepspeed.initialize(
model=model,
optimizer=optimizer,
config_params=deepspeed_config
)
Ensure zero_allow_untested_optimizer
is enabled if you’re using ZeRO optimization.