FSDP FULL_SHARD: 3GPUs works, 2GPUs hangs at 1st step

Custom SDXL training script. 3x rtx4090s.

FSDP SHARD_GRAD_OP:
Works fine with both 2 and 3 gpus.

FSDP FULL_SHARD:
Works fine with 3 gpus.

With 2 gpus:
hangs at: accelerator.backward(loss).
Never completes 1st step.
~13GB of 24GB vram used, which seems correct.
Tried various pairs of 2 gpus, no change.
tried w/ and w/o fsdp_offload_params

How should I go about trouble shooting this?
All libraries are latest, just updated before testing the script.

update:
1gpu with with NO_SHARD & offload_params works
2gpu with with NO_SHARD & offload_params works

Everything seems to work except 2gpu FULL_SHARD.
No idea what to make of this.