FSDP FULL_SHARD: 3GPUs works, 2GPUs hangs at 1st step

miniEnglish1 · August 26, 2024, 4:46am

Custom SDXL training script. 3x rtx4090s.

FSDP SHARD_GRAD_OP:
Works fine with both 2 and 3 gpus.

FSDP FULL_SHARD:
Works fine with 3 gpus.

With 2 gpus:
hangs at: accelerator.backward(loss).
Never completes 1st step.
~13GB of 24GB vram used, which seems correct.
Tried various pairs of 2 gpus, no change.
tried w/ and w/o fsdp_offload_params

How should I go about trouble shooting this?
All libraries are latest, just updated before testing the script.

update:
1gpu with with NO_SHARD & offload_params works
2gpu with with NO_SHARD & offload_params works

Everything seems to work except 2gpu FULL_SHARD.
No idea what to make of this.

Topic		Replies	Views
FSDP accelerate.prepare gives OOM. How to load model into single GPU, then distribute shards? 🤗Accelerate	2	1117	January 24, 2024
ValueError: Using fsdp only works in distributed training 🤗Transformers	6	2085	July 24, 2024
Issues with Dataset Loading and Checkpoint Saving using FSDP with HuggingFace Trainer on SLURM Multi-Node Setup 🤗Accelerate	1	120	April 7, 2025
How to use FSDP + DPP in Trainer 🤗Transformers	1	1004	April 24, 2023
How to use FSDP or DDP with Seq2SeqTrainer? 🤗Transformers	0	982	May 22, 2023

FSDP FULL_SHARD: 3GPUs works, 2GPUs hangs at 1st step

Related topics