What is the best way to save the state of a model and optimizer, when the model has 2 LoRas?

I have seen in PEFT they recommend using model.save_pretrained(), but I am using Accelerate and I also want to save the optimizer state, and I wonder if I can just use accelerator.save_state().

this is my (pseudo)code:

lora_config = LoraConfig()

# get peft model initializing 1 trainable lora addapter
model = get_peft_model(<some_HF_model>, lora_config, name="lora1")

# add a second trainable lora addapter
model.add_adapter(lora_config, name="lora2")

# prepare for training with accelerate
agent, optimizer = accelerator.prepare(agent, optimizer)

should accelerator.save_state() work now? I get timeout error when trying

1 Like

Since the items being saved are different, it’s generally safer to save both.


Use two saves every time.

  1. accelerator.save_state(dir) to resume training with the same script. It saves model shards, optimizer, LR scheduler, RNG, and more. Use only for resuming. (Hugging Face)
  2. unwrap_model(model).save_pretrained(...) to export your LoRA adapters. Pass both adapter names and a gathered state dict from Accelerate. This produces portable adapters for reload or sharing. (Hugging Face)

Background and context

  • Why two saves? Accelerate checkpoints are for resumption of the exact run. PEFT’s save_pretrained is for portable adapters independent of your training wrapper. They solve different problems. (Hugging Face)
  • Two LoRAs are fine. PEFT supports multiple LoRA adapters on one base model. You add and name them, then explicitly activate the one(s) you want. Mixing different adapter types needs PeftMixedModel. (Hugging Face)
  • Gathering weights under ZeRO/FSDP. Parameters are sharded. Use accelerator.get_state_dict(model) when exporting so the full adapter weights are materialized correctly. (Hugging Face)

Clear recipe

1) Build, prepare, and train with two LoRAs

# refs:
# accelerate checkpointing: https://huggingface.co/docs/accelerate/en/usage_guides/checkpoint
# peft model API: https://huggingface.co/docs/peft/en/package_reference/peft_model
# fsdp state dict: https://huggingface.co/docs/accelerate/v0.20.3/en/usage_guides/fsdp

from accelerate import Accelerator
from peft import LoraConfig, get_peft_model
import torch, os

acc = Accelerator()

base = <load transformers model>  # e.g. AutoModelForCausalLM.from_pretrained(...)
cfg = LoraConfig(r=16, lora_alpha=16, lora_dropout=0.05, target_modules=["q_proj","v_proj"])

model = get_peft_model(base, cfg, adapter_name="lora1")              # add first LoRA
model.add_adapter("lora2", peft_config=cfg)                           # add second LoRA

# only train adapters
optim = torch.optim.AdamW((p for p in model.parameters() if p.requires_grad), lr=2e-4)

model, optim = acc.prepare(model, optim)

PEFT methods used: add_adapter, set_adapter, save_pretrained. (Hugging Face)

2) Save a resumable checkpoint (model+optimizer+RNG)

acc.wait_for_everyone()                         # sync all ranks  # accelerate docs ↑
acc.save_state("ckpt_step_123")                 # resumable only  # accelerate docs ↑

This produces a checkpoint that acc.load_state("ckpt_step_123") can restore in the same script after you rebuild the same adapters and call prepare. (Hugging Face)

3) Export portable LoRA adapters (both of them)

unwrapped = acc.unwrap_model(model)             # remove DDP/DS/FSDP wrappers  # accelerate docs ↑
unwrapped.save_pretrained(
    "adapters_out",
    selected_adapters=["lora1", "lora2"],      # save both adapters  # peft docs ↑
    state_dict=acc.get_state_dict(model),      # gather from ZeRO/FSDP  # fsdp guide ↑
)

selected_adapters lets you choose which adapters to write. If omitted, PEFT saves all. get_state_dict avoids missing shards with ZeRO-3 or FSDP. (Hugging Face)

4) Resume training later

# Rebuild the SAME adapter names, then prepare, then load:
base = <load transformers model>
model = get_peft_model(base, cfg, adapter_name="lora1")
model.add_adapter("lora2", peft_config=cfg)

model, optim = acc.prepare(model, optim)
acc.load_state("ckpt_step_123")                 # restores model/optimizer/RNG  # accelerate docs ↑

Names must match your original adapters. (Hugging Face)

5) Use the exported adapters elsewhere

base = <load transformers model>
from peft import PeftModel

peft_model = PeftModel.from_pretrained(base, "adapters_out", adapter_name="lora1")  # attach lora1
peft_model.set_adapter("lora1")                  # activate one adapter
# To switch:
peft_model.load_adapter("adapters_out", adapter_name="lora2")
peft_model.set_adapter("lora2")

save_pretrained for a multi-adapter model writes each adapter to its own subdir. Loading needs explicit adapter_name and activation. (Hugging Face)

Common pitfalls and fixes

  • Only saving on rank 0 under ZeRO/FSDP. Do not guard save_state() with is_main_process. Optimizer state is sharded across ranks. Guarding can deadlock and cause NCCL timeouts. (GitHub)
  • NCCL timeout on save_state. Seen with DeepSpeed stages 1–2. Use fast local storage, call wait_for_everyone(), and avoid wrapper saves that skip ranks. Some reports show group timeout kwargs not applied with DS. (Hugging Face Forums)
  • Not unwrapping before exporting. Always unwrap_model when calling save_pretrained. It removes training wrappers, not copying weights. (Hugging Face)
  • Forgetting to gather the state dict. Under ZeRO-3/FSDP, call accelerator.get_state_dict(model) before exporting. Without it you can save empty or partial tensors. (Hugging Face)
  • No change when switching adapters. You must set_adapter() to activate the target adapter name. Mismatched or inactive names lead to identical outputs. (Hugging Face)

What each artifact contains

  • accelerator.save_state(dir): model state dict as seen by Accelerate, optimizer state, LR scheduler state, AMP/GradScaler, RNG. Intended for training resumption with the same code path. (Hugging Face)
  • PeftModel.save_pretrained(dir, selected_adapters=[...]): PEFT adapter weights and config for the listed adapter names. Portable, shareable. Load back with PeftModel.from_pretrained(..., adapter_name=...) or load_adapter. (Hugging Face)

Sanity checks before shipping

  • After saving adapters, reload them in a fresh process and run a tiny batch to verify numerics changed when toggling set_adapter("lora1") vs ("lora2"). This catches name or activation mistakes. (GitHub)
  • Under FSDP, if memory is tight when gathering, use the FSDP state-dict context from Accelerate docs to offload to CPU while calling get_state_dict. (Hugging Face)

Minimal end-to-end example

# URLs in comments:
# accelerate checkpointing: https://huggingface.co/docs/accelerate/en/usage_guides/checkpoint
# peft PeftModel API (selected_adapters, add_adapter, set_adapter): https://huggingface.co/docs/peft/en/package_reference/peft_model
# accelerate fsdp state dict tips: https://huggingface.co/docs/accelerate/v0.20.3/en/usage_guides/fsdp

from accelerate import Accelerator
from peft import LoraConfig, get_peft_model
import torch

acc = Accelerator()
base = <load>
cfg = LoraConfig(target_modules=["q_proj","v_proj"])

model = get_peft_model(base, cfg, adapter_name="lora1")
model.add_adapter("lora2", peft_config=cfg)

optim = torch.optim.AdamW((p for p in model.parameters() if p.requires_grad), lr=2e-4)
model, optim = acc.prepare(model, optim)

# ... training ...

# A) resumable checkpoint
acc.wait_for_everyone()
acc.save_state("ckpt_step_123")  # accelerate docs ↑

# B) portable adapters
unwrapped = acc.unwrap_model(model)
unwrapped.save_pretrained(
    "adapters_out",
    selected_adapters=["lora1","lora2"],       # peft docs ↑
    state_dict=acc.get_state_dict(model),      # fsdp/zero safe ↑
)

Supplemental materials

Core docs

  • Accelerate checkpointing: save_state and load_state. Clear scope and expectations. (Hugging Face)
  • Accelerate API reference: unwrap_model, register_for_checkpointing, get_state_dict mentions. (Hugging Face)
  • FSDP state-dict guide in Accelerate: how get_state_dict interacts with FSDP and how to offload. (Hugging Face)
  • PEFT PeftModel reference: add_adapter, set_adapter, save_pretrained(selected_adapters=...). (Hugging Face)

Good issue threads for edge cases

  • Multi-adapter save vs load behavior and expectations. (GitHub)
  • Example patterns using unwrap_model(...).save_pretrained(..., state_dict=accelerator.get_state_dict(...)). (GitHub)
  • save_state() timeouts and DeepSpeed notes. (Hugging Face Forums)
  • Early discussion on saving model+optimizer+LR+RNG expectations. (GitHub)

Quick references

  • PyTorch tutorial on saving model and optimizer state dicts. Useful if you drop Accelerate. (PyTorch Documentation)
  • FSDP optimizer state load helpers if you roll your own FSDP without Accelerate. (PyTorch Documentation)

Bottom line

  • Do both saves. Use accelerator.save_state() for resume. Use unwrap_model(model).save_pretrained(..., selected_adapters=[...], state_dict=accelerator.get_state_dict(model)) for the two LoRAs. This separation avoids deadlocks, preserves optimizer state, and gives you portable adapters. (Hugging Face)
1 Like

Thank you so much for your detailed response! :slight_smile: @John6666

I was able to save the training state with accelerate without timeouts! … but I am now facing issues when trying to load the state back and continue training.

this is what I am doing:

# agent is a custom class which uses a VLM model (i.e. agent.vlm is a HF model)
agent, optimizer = accelerator.prepare(agent, optimizer)

# debug:
for name, param in agent.named_parameters():
    print(name)
# module.vlm.model.base_model.model.model.language_model.layers.35.self_attn.o_proj.lora_A.lora_1.weight
# module.vlm.model.base_model.model.model.language_model.layers.35.self_attn.o_proj.lora_A.lora_2.weight

but when I do:

accelerator.load_state(checkpoint_path)

I get:

[2025-10-30 16:25:58,412] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from <model_path>/mp_rank_00_model_states.pt...
[2025-10-30 16:26:27,156] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from <model_path>/mp_rank_00_model_states.pt.
[2025-10-30 16:26:28,022] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from <model_path>/mp_rank_00_model_states.pt.
[2025-10-30 16:26:28,158] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from <model_path>/mp_rank_00_model_states.pt.
[2025-10-30 16:26:28,324] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from <model_path>/mp_rank_00_model_states.pt...
[2025-10-30 16:26:28,977] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from <model_path>/mp_rank_00_model_states.pt...
[2025-10-30 16:26:29,302] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from <model_path>/mp_rank_00_model_states.pt...
[2025-10-30 16:26:29,344] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from <model_path>/mp_rank_00_model_states.pt.
[2025-10-30 16:26:30,254] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from <model_path>/mp_rank_00_model_states.pt...
[2025-10-30 16:26:57,833] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from <model_path>/mp_rank_00_model_states.pt.
[rank2]: Traceback (most recent call last):
[rank2]:   File "script.py", line 143, in <module>
[rank2]:     accelerator.load_state(args.checkpoint_dir)
[rank2]:   File "python/lib/python3.10/site-packages/accelerate/accelerator.py", line 3089, in load_state
[rank2]:     model.load_checkpoint(input_dir, ckpt_id, **load_model_func_kwargs)
[rank2]:   File "python/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2806, in load_checkpoint
[rank2]:     load_path, client_states = self._load_checkpoint(load_dir,
[rank2]:   File "python/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2889, in _load_checkpoint
[rank2]:     self.load_module_state_dict(checkpoint=checkpoint,
[rank2]:   File "python/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2681, in load_module_state_dict
[rank2]:     param.data.copy_(saved_frozen_params[name].data)
[rank2]: KeyError: 'vlm.model.base_model.model.model.visual.blocks.0.attn.qkv.lora_A.lora_2.weight'
[2025-10-30 16:26:59,399] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from <model_path>/mp_rank_00_model_states.pt.
Traceback (most recent call last):
  File "script.py", line 143, in <module>
    accelerator.load_state(args.checkpoint_dir)
  File "python/lib/python3.10/site-packages/accelerate/accelerator.py", line 3089, in load_state
    model.load_checkpoint(input_dir, ckpt_id, **load_model_func_kwargs)
  File "python/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2806, in load_checkpoint
    load_path, client_states = self._load_checkpoint(load_dir,
  File "python/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2889, in _load_checkpoint
    self.load_module_state_dict(checkpoint=checkpoint,
  File "python/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2681, in load_module_state_dict
    param.data.copy_(saved_frozen_params[name].data)
KeyError: 'vlm.model.base_model.model.model.visual.blocks.0.attn.qkv.lora_A.lora_2.weight'
[rank0]: Traceback (most recent call last):
[rank0]:   File "script.py", line 143, in <module>
[rank0]:     accelerator.load_state(args.checkpoint_dir)
[rank0]:   File "python/lib/python3.10/site-packages/accelerate/accelerator.py", line 3089, in load_state
[rank0]:     model.load_checkpoint(input_dir, ckpt_id, **load_model_func_kwargs)
[rank0]:   File "python/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2806, in load_checkpoint
[rank0]:     load_path, client_states = self._load_checkpoint(load_dir,
[rank0]:   File "python/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2889, in _load_checkpoint
[rank0]:     self.load_module_state_dict(checkpoint=checkpoint,
[rank0]:   File "python/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2681, in load_module_state_dict
[rank0]:     param.data.copy_(saved_frozen_params[name].data)
[rank0]: KeyError: 'vlm.model.base_model.model.model.visual.blocks.0.attn.qkv.lora_A.lora_2.weight'
[2025-10-30 16:27:01,047] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from <model_path>/mp_rank_00_model_states.pt.
W1030 16:27:01.766000 2884113 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2884289 closing signal SIGTERM
W1030 16:27:01.768000 2884113 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2884290 closing signal SIGTERM
W1030 16:27:01.768000 2884113 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2884292 closing signal SIGTERM
[rank1]: Traceback (most recent call last):
[rank1]:   File "script.py", line 143, in <module>
[rank1]:     accelerator.load_state(args.checkpoint_dir)
[rank1]:   File "python/lib/python3.10/site-packages/accelerate/accelerator.py", line 3089, in load_state
[rank1]:     model.load_checkpoint(input_dir, ckpt_id, **load_model_func_kwargs)
[rank1]:   File "python/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2806, in load_checkpoint
[rank1]:     load_path, client_states = self._load_checkpoint(load_dir,
[rank1]:   File "python/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2889, in _load_checkpoint
[rank1]:     self.load_module_state_dict(checkpoint=checkpoint,
[rank1]:   File "python/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2681, in load_module_state_dict
[rank1]:     param.data.copy_(saved_frozen_params[name].data)
[rank1]: KeyError: 'vlm.model.base_model.model.model.visual.blocks.0.attn.qkv.lora_A.lora_2.weight'

agent is a custom class but it still inherits from nn.Module so I thought it should be fine.

It seems both the model instance and the checkpoint have the correct parameters but saved under different names.

1 Like

When using multi-GPU setups, you often have to adjust training to match the behavior of backends like DeepSpeed, which leads to a significant increase in unfamiliar errors…:sweat_smile:
Workarounds may or may not exist, so if avoiding them proves difficult, changing the backend might sometimes solve the issue.


You are loading into a different module graph than you saved. DeepSpeed is trying to restore a frozen parameter named
vlm.model.base_model.model.model.visual.blocks.0.attn.qkv.lora_A.lora_2.weight,
but your rebuilt model only has LoRA params under the language stack (...language_model...o_proj...lora_1/2). That mismatch triggers a saved_frozen_params[...] KeyError during engine.load_module_state_dict. The fix is to rebuild the exact adapters, names, and freeze mask before wrapping with Accelerate/DeepSpeed, then load. (GitHub)

What is going wrong

  • Adapter topology drift. The checkpoint expects a vision-tower LoRA on visual.blocks.*.attn.qkv named lora_2. Your runtime model shows only language-side LoRAs. Names and target modules must be identical between save and load. (Hugging Face)
  • Frozen vs trainable drift. DeepSpeed saves “frozen” tensors separately and restores them from saved_frozen_params. If a tensor that was frozen at save time isn’t present under the same name at load time, you hit this KeyError. Changing freeze config or what you attach LoRAs to will do it. (deepspeed.readthedocs.io)
  • Wrapper timing. Create all adapters before accelerator.prepare(...) so parameter registration and names match what DeepSpeed saved. This has been a known source of resume issues with LoRA+ZeRO. (GitHub)

Correct, strict resume recipe

  1. Rebuild the same adapters with the same names and target modules for both language and vision parts, then reapply the same requires_grad mask. Do this before prepare.
  2. Wrap with Accelerate, then load state with strict key checking.
  3. If loading still fails, diff keys from the checkpoint and your model to find the missing or extra adapters.
# URLs in comments for each API:
# - Accelerate load_state + kwargs: https://huggingface.co/docs/accelerate/en/package_reference/accelerator  # load_model_func_kwargs
# - PEFT multi-adapter + naming: https://huggingface.co/docs/transformers/main/en/peft
# - DeepSpeed load_checkpoint flags: https://deepspeed.readthedocs.io/en/latest/model-checkpointing.html

from accelerate import Accelerator
from peft import LoraConfig, get_peft_model
import torch

accelerator = Accelerator()

# 1) Rebuild EXACT topology
txt_cfg = LoraConfig( # example; match what you trained
    r=16, lora_alpha=16, lora_dropout=0.05,
    target_modules=["q_proj","k_proj","v_proj","o_proj"]
)
vis_cfg = LoraConfig( # only if you trained vision LoRA
    r=16, lora_alpha=16, lora_dropout=0.05,
    target_modules=["qkv"]  # ViT-style fused qkv
)

base = <load your VLM backbone>

model = get_peft_model(base, txt_cfg, adapter_name="lora_1")
model.add_adapter("lora_2", peft_config=txt_cfg)

# If your checkpoint included vision adapters, attach them again with the SAME names
# e.g., using the same adapter names or a dedicated vision-only LoRA setup if that's what you saved

# Reapply the SAME freeze mask you used when saving (e.g., only LoRA trainable)
for n, p in model.named_parameters():
    p.requires_grad = ("lora_" in n)

# 2) Wrap and load
agent = MyAgent(vlm=model)
agent, optimizer = accelerator.prepare(agent, optimizer)

accelerator.load_state(
    "<checkpoint_dir>",
    load_model_func_kwargs={"load_module_strict": True},  # DeepSpeed's strict flag
)
  • load_model_func_kwargs is forwarded by Accelerate to the backend (engine.load_checkpoint). Use load_module_strict=True for a faithful resume. (Hugging Face)
  • PEFT requires explicit adapter names and activation. Multi-adapter management is documented here. (Hugging Face)

If it still errors: find the mismatch fast

Extract an FP32 state dict from the ZeRO checkpoint and diff keys.
This tells you exactly which LoRA names or submodules the checkpoint contains vs your model.

# DeepSpeed utility: https://deepspeed.readthedocs.io/en/latest/model-checkpointing.html#get-fp32-weights
from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint

sd = get_fp32_state_dict_from_zero_checkpoint("<checkpoint_dir>", exclude_frozen_parameters=False)
keys_ckpt = set(sd.keys())
keys_model = set(k for k in agent.state_dict().keys())  # after prepare

missing = [k for k in keys_model if k not in keys_ckpt and "lora_" in k]
extra   = [k for k in keys_ckpt if k not in keys_model and "lora_" in k]

print("Missing in checkpoint:", missing[:20])
print("Extra in checkpoint:", extra[:20])
  • The helper reconstructs a single state_dict for inspection. Set exclude_frozen_parameters=False so frozen LoRA weights appear in the dump. (deepspeed.readthedocs.io)

Make future resumes robust

  • Save with frozen included. When saving with Accelerate, pass through the DeepSpeed flag so frozen weights are kept in the checkpoint. This avoids surprises if you change the freeze mask later.

    accelerator.save_state(
        "<out_dir>",
        save_model_func_kwargs={"exclude_frozen_parameters": False}
    )
    

    DeepSpeed documents the exclude_frozen_parameters knob for checkpointing. (deepspeed.readthedocs.io)

  • Consistent creation order and names. Always attach all adapters with explicit adapter_names before wrapping. Avoid renames like lora1 vs lora_1 across runs. PEFT stores names in parameter keys (e.g., ...lora_A.lora_2.weight). (Hugging Face)

  • Strict first, non-strict only as a stopgap. If you must move forward, you can relax:

    accelerator.load_state("<checkpoint_dir>", load_model_func_kwargs={"load_module_strict": False})
    

    Use this only to unblock. It skips mismatched keys rather than fixing them. The flag is part of load_checkpoint. (deepspeed.readthedocs.io)

Quick validations you can run

  • Print LoRA keys pre-load. Confirm you actually rebuilt the vision LoRA if your checkpoint expects it:

    for n, _ in agent.named_parameters():
        if "lora_" in n and ("visual" in n or "language_model" in n):
            print(n)
    

    Your error explicitly references visual.blocks.0.attn.qkv...lora_2, so those keys must exist in your model before load_state. If they don’t, attach the vision adapters with the same target_modules and names. (Hugging Face)

  • Version pin. Resume behavior changes across Accelerate and DeepSpeed versions. If you recently upgraded, confirm against the docs for your version and known issues around resume. (Hugging Face)

Similar cases for reference

  • Exact saved_frozen_params[...] KeyError on vision tower while resuming with ZeRO-3. Same call path as yours. Resolution: align frozen mask and module keys. (GitHub)
  • PEFT + ZeRO-3 resume “missing keys”. Root causes: inconsistent freeze settings or adapter setup across runs. (GitHub)
  • Historical LoRA+ZeRO-3 init-order problems. Motivation to attach adapters before wrapping. (GitHub)

Bottom line

Rebuild the same adapters (vision and language) with the same names and same target modules, reapply the same freeze mask, then call accelerator.load_state(...) (strict). If it fails, extract FP32 weights and diff to locate the missing LoRA keys. Save future checkpoints with exclude_frozen_parameters=False to keep frozen LoRA params. These steps align with the Accelerate loader contract and DeepSpeed checkpoint semantics. (Hugging Face)

Docs and threads

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.