Custom SDXL training script. 3x rtx4090s.
FSDP SHARD_GRAD_OP:
Works fine with both 2 and 3 gpus.
FSDP FULL_SHARD:
Works fine with 3 gpus.
With 2 gpus:
hangs at: accelerator.backward(loss).
Never completes 1st step.
~13GB of 24GB vram used, which seems correct.
Tried various pairs of 2 gpus, no change.
tried w/ and w/o fsdp_offload_params
How should I go about trouble shooting this?
All libraries are latest, just updated before testing the script.
update:
1gpu with with NO_SHARD & offload_params works
2gpu with with NO_SHARD & offload_params works
Everything seems to work except 2gpu FULL_SHARD.
No idea what to make of this.