Loss spike when resuming from FSDP SHARDED_STATE_DICT checkpoint (possible optimizer-state mismatch)

mjkmain · June 28, 2025, 4:49am

Hi Accelerate team — first off, thank you for the phenomenal work on the library and its FSDP integration.
I’m running large-scale multi-node jobs and have hit what looks like a systematic resume-instability whenever I save checkpoints with fsdp_state_dict_type: SHARDED_STATE_DICT.

In short:

Training is perfectly stable until I interrupt it and resume.
The very first step after --resume_from_checkpoint shows a huge loss spike (see first plot at step ≈ 10 k).
If I change only the state-dict format to FULL_STATE_DICT, the same run resumes smoothly with no spike (second plot) — but the wall-clock time and I/O overhead of saving full checkpoints are prohibitive at this scale.

Because the model, data, and training hyper-parameters are identical between the two runs, the culprit seems to be how the sharded checkpoint restores the optimizer state and/or parameter dtypes (I train in bf16, yet the saved shards are fp32).

Below I’ve included system details, a minimal reproducible script, and the exact configs that trigger the problem. I’d be grateful for any insights or work-arounds that would let me keep the sharded format while resuming safely.

Expected behavior: Resumed training should continue smoothly at the previous loss level, identical to a FULL_STATE_DICT checkpoint, but without the high I/O overhead.

This plot shows the loss of resume_from_checkpoint after saving a checkpoint with SHARDED_STATE_DICT. You can see the loss spike at the 10K point.

This plot is the loss when saving as FULL_STATE_DICT and resuming training with resume_from_checkpoint. See the smoothed connections. But this method takes too long to save.

Questions

Is this a known limitation or bug with FSDP SHARDED_STATE_DICT in accelerate?
Could the optimizer state be dropped or mismatched while saving/loading shards?
Is there a recommended workaround (e.g., enabling fsdp_cpu_ram_efficient_loading or changing another flag) that lets me keep sharded checkpoints yet resume stably?
Why are the weights serialized in fp32 even though mixed_precision=bf16 was used during training?
Any pointers would be greatly appreciated—thanks for your time and for all the work on Accelerate!

System Info

AWS p5en.48-xlarge, 16 nodes

Accelerate version: 1.8.1
Platform: Linux-6.8.0-1028-aws-x86_64-with-glibc2.35
Python version: 3.12.11
Numpy version: 2.3.1
PyTorch version: 2.7.0+cu126
PyTorch accelerator: CUDA
System RAM: 1999.95 GB
GPU type: NVIDIA H200

Accelerate configs
distributed_type: FSDP
mixed_precision: bf16
fsdp_config:
fsdp_state_dict_type: SHARDED_STATE_DICT
offload_to_cpu: false
save_optimizer_state: true
fsdp_sharding_strategy: FULL_SHARD
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_transformer_layer_cls_to_wrap: PreNormDecoderLayer
fsdp_activation_checkpointing: false
fsdp_backward_prefetch: BACKWARD_PRE
fsdp_forward_prefetch: false
fsdp_cpu_ram_efficient_loading: false
fsdp_offload_params: false
fsdp_sync_module_states: true
fsdp_use_orig_params: false

Reproduction

Train for a few hundred steps with the first config until the loss plateaus.
Allow accelerate to save an automatic checkpoint.
Relaunch the same command with --resume_from_checkpoint path/to/latest.
Observe the immediate loss jump on the next optimizer step.

train.py

...
if train_args.resume_from_checkpoint:
    trainer.accelerator.load_state(train_args.resume_from_checkpoint)
    trainer.train(resume_from_checkpoint=train_args.resume_from_checkpoint)

else:
    trainer.train()

John6666 · June 28, 2025, 7:05am

4

I know the known issues regarding this.

github.com/huggingface/accelerate

Final output is upcasted to float32 even though I used mixed_precision="fp16"

opened 03:30PM - 11 Jun 25 UTC

antoinedelplace

### System Info ```Shell - `Accelerate` version: 1.7.0 - Platform: Linux-6.1.12…3+-x86_64-with-glibc2.35 - `accelerate` bash location: /usr/local/bin/accelerate - Python version: 3.11.13 - Numpy version: 2.0.2 - PyTorch version: 2.6.0+cu124 - PyTorch accelerator: CUDA - System RAM: 12.67 GB - GPU type: Tesla T4 - `Accelerate` default config: Not found ``` ### Information - [ ] The official example scripts - [x] My own modified scripts ### Tasks - [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported `no_trainer` script in the `examples` folder of the `transformers` repo (such as `run_no_trainer_glue.py`) - [x] My own task or dataset (give details below) ### Reproduction Final output is upcasted to float32 even though I used mixed_precision="fp16". I don't understand why. Minimal example: ``` import torch import torch.nn as nn from accelerate import Accelerator from torch import amp class Net(nn.Module): def __init__(self): super().__init__() self.layer = nn.Conv2d(3, 3, kernel_size=3, stride=1, padding=1) def forward(self, x): x = self.layer(x) print("x2", x.dtype, x.device) return x accelerator = Accelerator(mixed_precision="fp16") model = Net() model = accelerator.prepare(model) model.eval() x = torch.rand((3, 1024, 1024), dtype=torch.float16, device="cuda") print("x1", x.dtype, x.device) with torch.no_grad(), amp.autocast(device_type='cuda', dtype=torch.float16): x = model(x) print("x3", x.dtype, x.device) ``` Output is : ``` x1 torch.float16 cuda:0 x2 torch.float16 cuda:0 x3 torch.float32 cuda:0 ``` Can you help please? Am I doing something wrong? ### Expected behavior Expected output is : ``` x1 torch.float16 cuda:0 x2 torch.float16 cuda:0 x3 torch.float16 cuda:0 ```

Topic	Replies	Views
Difficulty with checkpoint saving and loading (trainer+ FSDP accelerate) Beginners	567	April 1, 2024
Deepspeed resume training from saved states 🤗Accelerate	1271	September 8, 2022
Weird behavior when saving checkpoint in DDP 🤗Accelerate	49	August 11, 2024
FSDP training not saving the best checkpoint and load from checkpoint fails 🤗Transformers	793	January 23, 2024
Eval Loss spike Seq2seq Trainer Resume from Checkpoint 🤗Transformers	520	June 22, 2021

Loss spike when resuming from FSDP SHARDED_STATE_DICT checkpoint (possible optimizer-state mismatch)

System Info

Reproduction

Related topics