Gradient Overflow issue while using deepspeed

Hi. I’m trying to fine-tune mistralai/Mistral-Small-24B-Base-2501 using deepspeed and consistently getting the overflow error. When I use bf16 and fp32,I don’t see the overflow issue but the training loss is Nan. When I switch to fp16 the training loss is correct but it throws the overflow error. How can I fix this? This works fine with smaller models. Using lr=1e-7.

My df_config.json:

{
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 8,
    "zero_optimization": {
        "stage": 2
    },
    "zero_allow_untested_optimizer": true,
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "initial_scale_power": 32,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "gradient_clipping": 1.0,
    "wall_clock_breakdown": false
}

Using deepspeed 0.17.2 and transformers 4.42.4.

1 Like

If the GPU supports bfloat16, it’s probably better to use bfloat16. Regarding NaN issues, SDPA seems to be the culprit in many cases. Try attn_implementation="eager".

1 Like

@John6666 loading the model in bfloat16 and then using bf16=true in deepspeed seems to solve this issue for now!

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.