Saving bf16 Model Weights When Using Accelerate+DeepSpeed

OxxoCodes · July 22, 2024, 9:34pm

Hello! What’s the correct way to save a model that’s being trained using Zero3 and bf16 mixed precision via the Accelerate implementation of DeepSpeed?

The closest I can find is setting the stage3_gather_16bit_weights_on_model_save flag in the config, but from my understanding setting this to true would save it as fp16 , while leaving it as false would require me to run zero_to_fp32.py, which would save the model weights as fp32 .

How do I preserve the bf16 weights?

jpcorb20 · November 11, 2024, 2:05pm

Hello, same question here!

jpcorb20 · March 17, 2025, 3:16pm

any update on this?

John6666 · March 17, 2025, 5:02pm

Hmm… by Hugging Chat:

To save a model trained using ZeRO-3 and BF16 mixed precision via the Accelerate implementation of DeepSpeed, you need to ensure that the model weights are saved in their partitioned form while preserving the BF16 precision. Here is how you can achieve this:

Key Points:

ZeRO-3 and BF16 Configuration:
- When using ZeRO Stage-3 with BF16 mixed precision, DeepSpeed partitions the model weights across multiple GPUs. By default, the weights are stored as BF16 during training [1].
Saving the Model:
- To save the model in BF16 without converting it to FP32, you should not set stage3_gather_16bit_weights_on_model_save to True. Instead, keep it as False or omit it entirely [1].
- If stage3_gather_16bit_weights_on_model_save is set to True, DeepSpeed will gather and save the weights in FP16 format, which is not what you want for preserving BF16 weights [1][2].
Model Saving Process:
- During training with ZeRO-3 and BF16, the model weights are already stored in BF16 format.
- When saving the model, if stage3_gather_16bit_weights_on_model_save is False (default), the model will be saved in its partitioned BF16 form [1].
Loading the Saved Model:
- If you have saved the model with stage3_gather_16bit_weights_on_model_save=False, you can load the model normally using load_state_dict() without needing additional conversion scripts like zero_to_fp32.py [1].

Example Configuration:

Here’s an example of how to configure the DeepSpeed settings for your use case:

# DeepSpeed config file (.json)
{
    "bf16": {"enabled": true},
    "zero_optimization": {
        "stage": 3,
        "stage3_gather_16bit_weights_on_model_save": false,
        "offload_optimizer": {"device": "none"},
        "offload_param": {"device": "none"}
    }
}

This configuration ensures that:

The model is trained using BF16 mixed precision.
The weights remain in BF16 format during training.
The weights are saved in their partitioned BF16 form when saving the model.

Final Answer:

To save a model trained using Zero3 and BF16 mixed precision while preserving the BF16 weights:

Ensure the following configuration in your DeepSpeed settings:
- Keep stage3_gather_16bit_weights_on_model_save as False. This prevents DeepSpeed from converting the weights to FP16 during saving [1].
- Set bf16 to {"enabled": true} to maintain BF16 precision throughout the training and saving process [1].
Save the model normally:
- Use the standard model saving methods provided by Hugging Face or PyTorch, as the weights will already be in BF16 format when saved.

By following these steps, you will successfully save the model weights in their native BF16 format for future use without unnecessary conversions [1][2].

jpcorb20 · March 17, 2025, 10:01pm

Works quite well indeed. Thanks a lot for the clear explanations !

On my end, I had to use trainer.save_model(output_directory) to save the model checkpoint correctly without errors. Then, I could run the zero_to_fp32.py to get the pytorch_model.bin, and finally I can load with .from_pretrained(...) and convert to safetensors to recover the same BF16 output model as other configurations (e.g. zero stage 1 or 2). Thanks so much again!

Topic		Replies	Views
Explicitly disable bf16 for some layers 🤗Transformers	2	18	June 17, 2025
Question met when using DeepSpeed ZeRO3 AMP for code testing on simple pytorch examples 🤗Accelerate	0	32	July 24, 2024
Trainer option to disable saving DeepSpeed checkpoints 🤗Transformers	8	6519	May 23, 2023
Saving checkpoint is too slow with deepspeed DeepSpeed	5	2823	March 6, 2024
Rewrite trainer's save_model method get unexpected pytorch_model.bin file DeepSpeed	0	403	January 8, 2024

Saving bf16 Model Weights When Using Accelerate+DeepSpeed

Key Points:

Example Configuration:

Final Answer:

Related topics