Gradient Overflow issue while using deepspeed

jaydeepb · August 28, 2025, 12:39am

Hi. I’m trying to fine-tune mistralai/Mistral-Small-24B-Base-2501 using deepspeed and consistently getting the overflow error. When I use bf16 and fp32,I don’t see the overflow issue but the training loss is Nan. When I switch to fp16 the training loss is correct but it throws the overflow error. How can I fix this? This works fine with smaller models. Using lr=1e-7.

My df_config.json:

{
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 8,
    "zero_optimization": {
        "stage": 2
    },
    "zero_allow_untested_optimizer": true,
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "initial_scale_power": 32,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "gradient_clipping": 1.0,
    "wall_clock_breakdown": false
}

Using deepspeed 0.17.2 and transformers 4.42.4.

John6666 · August 28, 2025, 1:04am

If the GPU supports bfloat16, it’s probably better to use bfloat16. Regarding NaN issues, SDPA seems to be the culprit in many cases. Try attn_implementation="eager".

jaydeepb · August 28, 2025, 4:50am

@John6666 loading the model in bfloat16 and then using bf16=true in deepspeed seems to solve this issue for now!

system · August 28, 2025, 4:51pm

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Overflow when using DeepSpeed for GPT-J (training aborts) DeepSpeed	4	9545	March 9, 2023
Checkpoint breaks with deepspeed 🤗Transformers	6	3464	March 20, 2021
Gettings nan with deepspeed 🤗Transformers	0	887	March 20, 2021
[Deepspeed] ZeRO-Infinity integration released and config changes DeepSpeed	2	2305	April 28, 2021
Gradient overflow when fine-tune t5 on CNN/DM dataset Beginners	5	1688	September 3, 2020

Gradient Overflow issue while using deepspeed

Related topics