Transformers Trainer + Accelerate FSDP: How do I load my model from a checkpoint?

Hi all,

I’ve fine-tuned a Llama2 model using the transformers Trainer class, plus accelerate and FSDP, with a sharded state dict. Now my checkpoint directories all have the model’s state dict sharded across multiple .distcp files; how do I open them, or convert them to a format I can open with .from_pretrained()? I’ve not found documentation on this anywhere. Any help would be greatly appreciated!

Thank you so much!

1 Like

So I run into the same issues and after many many google searches I found llama_receipe have a pull request that fixed the issue and provide a doc/script for it. I haven’t tested it yet, but it seems helpful.
Specifically the command should be
python -m llama_recipes.inference.checkpoint_converter_fsdp_hf --fsdp_checkpoint_path PATH/to/FSDP/Checkpoints --consolidated_model_path PATH/to/save/checkpoints --HF_model_path_or_name PATH/or/HF/model_name