Multinode FSDP not working

kushalj · October 11, 2023, 4:42am

I have a setup where I need to load 2 7B llama models (1 reward model and 1 SFT model). By default when I use accelerate (with accelerate config or with torchrun), it tries to load both the models fully on the same node despite having multiple nodes available and eventually crashes with out of memory. I wanted to know if it’s possible to load even a single model sharded across multiple nodes using FSDP. If that’s not possible, how do I go about loading multiple models in a multinode setting using FSDP so that it does not go out of memory? Any ideas would be appreciated.
Thanks

Topic		Replies	Views
Training using FSDP, qLoRa on multinode 🤗Accelerate	0	59	January 29, 2025
Loading a model which is saved on multiple nodes using sharded_state_dict? 🤗Accelerate	0	72	August 13, 2024
Loading a peft model which is saved on multiple nodes using sharded_state_dict? 🤗Transformers	0	33	August 2, 2024
FSDP accelerate.prepare gives OOM. How to load model into single GPU, then distribute shards? 🤗Accelerate	2	1101	January 24, 2024
FSDP Auto Wrap does not work using `accelerate` in Multi-GPU Setup 🤗Accelerate	0	296	September 6, 2024

Multinode FSDP not working

Related topics