Saving bf16 Model Weights When Using Accelerate+DeepSpeed

Hello! What’s the correct way to save a model that’s being trained using Zero3 and bf16 mixed precision via the Accelerate implementation of DeepSpeed?

The closest I can find is setting the stage3_gather_16bit_weights_on_model_save flag in the config, but from my understanding setting this to true would save it as fp16 , while leaving it as false would require me to run zero_to_fp32.py, which would save the model weights as fp32 .

How do I preserve the bf16 weights?

3 Likes

Hello, same question here!

1 Like

any update on this?

1 Like

Hmm… by Hugging Chat:


To save a model trained using ZeRO-3 and BF16 mixed precision via the Accelerate implementation of DeepSpeed, you need to ensure that the model weights are saved in their partitioned form while preserving the BF16 precision. Here is how you can achieve this:


Key Points:

  1. ZeRO-3 and BF16 Configuration:

    • When using ZeRO Stage-3 with BF16 mixed precision, DeepSpeed partitions the model weights across multiple GPUs. By default, the weights are stored as BF16 during training [1].
  2. Saving the Model:

    • To save the model in BF16 without converting it to FP32, you should not set stage3_gather_16bit_weights_on_model_save to True. Instead, keep it as False or omit it entirely [1].
    • If stage3_gather_16bit_weights_on_model_save is set to True, DeepSpeed will gather and save the weights in FP16 format, which is not what you want for preserving BF16 weights [1][2].
  3. Model Saving Process:

    • During training with ZeRO-3 and BF16, the model weights are already stored in BF16 format.
    • When saving the model, if stage3_gather_16bit_weights_on_model_save is False (default), the model will be saved in its partitioned BF16 form [1].
  4. Loading the Saved Model:

    • If you have saved the model with stage3_gather_16bit_weights_on_model_save=False, you can load the model normally using load_state_dict() without needing additional conversion scripts like zero_to_fp32.py [1].

Example Configuration:

Here’s an example of how to configure the DeepSpeed settings for your use case:

# DeepSpeed config file (.json)
{
    "bf16": {"enabled": true},
    "zero_optimization": {
        "stage": 3,
        "stage3_gather_16bit_weights_on_model_save": false,
        "offload_optimizer": {"device": "none"},
        "offload_param": {"device": "none"}
    }
}

This configuration ensures that:

  • The model is trained using BF16 mixed precision.
  • The weights remain in BF16 format during training.
  • The weights are saved in their partitioned BF16 form when saving the model.

Final Answer:

To save a model trained using Zero3 and BF16 mixed precision while preserving the BF16 weights:

  1. Ensure the following configuration in your DeepSpeed settings:

    • Keep stage3_gather_16bit_weights_on_model_save as False. This prevents DeepSpeed from converting the weights to FP16 during saving [1].
    • Set bf16 to {"enabled": true} to maintain BF16 precision throughout the training and saving process [1].
  2. Save the model normally:

    • Use the standard model saving methods provided by Hugging Face or PyTorch, as the weights will already be in BF16 format when saved.

By following these steps, you will successfully save the model weights in their native BF16 format for future use without unnecessary conversions [1][2].

1 Like

Works quite well indeed. Thanks a lot for the clear explanations ! :slight_smile:

On my end, I had to use trainer.save_model(output_directory) to save the model checkpoint correctly without errors. Then, I could run the zero_to_fp32.py to get the pytorch_model.bin, and finally I can load with .from_pretrained(...) and convert to safetensors to recover the same BF16 output model as other configurations (e.g. zero stage 1 or 2). Thanks so much again!

1 Like