How effective FSDP with Accelerate?

I have been using Accelerate with my extended model containing BART-large on two GPU/48GB shared resource. The code segment looks like:

class Model(nn.Module):
    def __init__(self)
        self.seq2seq = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn)

The code is working but running very slow because the GPUs are shared with others so that they are constantly utilized at 100% seen from nvidia-smi.
So, I moved running onto two GPU/32GB with exclusive use. The running lasted around 1000 iterations before OOM. I am now configuring FSDP with Accelerate.
Here is the configuration yaml file:

deepspeed_config: {}
distributed_type: FSDP
fsdp_config: {
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP,
  fsdp_backward_prefetch_policy: BACKWARD_PRE,
  fsdp_forward_prefetch: false,
  fsdp_cpu_ram_efficient_loading: true,
  fsdp_offload_params: true,
  fsdp_sharding_strategy: FULL_SHARD,
  fsdp_state_dict_type: FULL_STATE_DICT,
  fsdp_cpu_offload: true,
  fsdp_rank0_only: true,
  fsdp_use_orig_params: true,
  fsdp_sync_module_states: true,
  fsdp_transformer_layer_cls_to_wrap: "BartEncoderLayer,BartDecoderLayer"
machine_rank: 0
main_process_port: 9697
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 2
use_cpu: false

I also followed the suggestion (Fully Sharded Data Parallel) to prepare the model before preparing the optimizer. But running only sustained for a few dozens of iterations before throwing the OOM:

  File "/shared/homes/my_home/dev/pyvenv_pt21/lib64/python3.8/site-packages/torch/autograd/", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 5.78 GiB. GPU 0 has a total capacity of 31.74 GiB of which 5.26 GiB is free. Including non-PyTorch memory, this process has 26.48 GiB memory in use. Of the allocated memory 24.73 GiB is allocated by PyTorch, and 788.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (
[2024-01-30 17:33:41,582] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4114313 closing signal SIGTERM
[2024-01-30 17:33:44,404] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 4114312) of binary: /shared/homes/my_home/dev/pyvenv_pt21/bin/python3.8
Traceback (most recent call last):
  File "/shared/homes/my_home/dev/pyvenv_pt21/bin/accelerate", line 8, in <module>
  File "/shared/homes/my_home/dev/pyvenv_pt21/lib64/python3.8/site-packages/accelerate/commands/", line 45, in main
  File "/shared/homes/my_home/dev/pyvenv_pt21/lib64/python3.8/site-packages/accelerate/commands/", line 902, in launch_command
  File "/shared/homes/my_home/dev/pyvenv_pt21/lib64/python3.8/site-packages/accelerate/commands/", line 599, in multi_gpu_launcher
  File "/shared/homes/my_home/dev/pyvenv_pt21/lib64/python3.8/site-packages/torch/distributed/", line 803, in run
  File "/shared/homes/my_home/dev/pyvenv_pt21/lib64/python3.8/site-packages/torch/distributed/launcher/", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/shared/homes/my_home/dev/pyvenv_pt21/lib64/python3.8/site-packages/torch/distributed/launcher/", line 268, in launch_agent
    raise ChildFailedError(

So, I am wondering how effective FSDP is utilized in Accelerate for a large model with a large dataset. Or, it may be just because I havenā€™t configured it more efficiently. Could anyone shed some light?