Learning rate with deepspeed is fixed despite lr set to auto

I’m trying to fine-tune a llama model with an adapted learning rate, but the learning rate is being reported as a fixed 5e-05 at every single step. Given that the deepspeed config has “lr”: “auto”, why isn’t the learning rate changing? The eval loss is constantly improving at every evaluation step, but very slowly.

I’m running my code like this:
deepspeed train_script.py

Relevant parts of code:

training_arguments = transformers.TrainingArguments(
num_train_epochs = NUM_EPOCHS,
logging_strategy = ‘steps’,
deepspeed=“ds_config_zero3_offload_param_offload_optimizer.json”, # args.deepspeed_config

trainer = transformers.Trainer(

Contents of ds_config_zero3_offload_param_offload_optimizer.json :
“fp16”: {
“enabled”: “auto”,
“loss_scale”: 0,
“loss_scale_window”: 1000,
“initial_scale_power”: 16,
“hysteresis”: 2,
“min_loss_scale”: 1

"optimizer": {
    "type": "AdamW",
    "params": {
        "lr": "auto",
        "betas": "auto",
        "eps": "auto",
        "weight_decay": "auto"

"scheduler": {
    "type": "WarmupLR",
    "params": {
        "warmup_min_lr": "auto",
        "warmup_max_lr": "auto",
        "warmup_num_steps": "auto"

"zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
        "device": "cpu",
        "pin_memory": true
    "offload_param": {
        "device": "cpu",
        "pin_memory": true
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true

"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 20,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false


The learning rate being “auto” in the deepspeed config just means it’ll use your huggingface training args to set the learning rate, not automatically adjust the learning rate during training.

The Huggingface LR default is 5e-5, which is why you’re seeing that value. And the LR is a constant because the min warmup LR and max warmup LR in your deepspeed scheduler config are both defaulting to 5e-5.

To increase the learning rate, you should pass a larger value to your Huggingface training args. If you want the learning to be something more exotic, you can set it to one of the options here.

1 Like

If you want to use WarmupLR, specifying the warmup_min_lr to 0 will increasse the learning rate from 0 to the learning rate specified elsewhere in your training arguments. After that, the learning rate will remain constant. If you want to decay the learning rate after reaching the peak learning rate, then you can use WarmupDecayLR like below:

    "scheduler": {
      "type": "WarmupDecayLR",
        "params": {
          "warmup_min_lr": 0,
          "warmup_max_lr": "auto",
          "warmup_num_steps": "auto",
          "warmup_type": "linear",
          "total_num_steps": "auto"