How effective FSDP with Accelerate?

chriss2023 · January 30, 2024, 7:05am

I have been using Accelerate with my extended model containing BART-large on two GPU/48GB shared resource. The code segment looks like:

class Model(nn.Module):
    def __init__(self)
        super().__init__()
        self.seq2seq = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn)
    ...

The code is working but running very slow because the GPUs are shared with others so that they are constantly utilized at 100% seen from nvidia-smi.
So, I moved running onto two GPU/32GB with exclusive use. The running lasted around 1000 iterations before OOM. I am now configuring FSDP with Accelerate.
Here is the configuration yaml file:

deepspeed_config: {}
distributed_type: FSDP
fsdp_config: {
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP,
  fsdp_backward_prefetch_policy: BACKWARD_PRE,
  fsdp_forward_prefetch: false,
  fsdp_cpu_ram_efficient_loading: true,
  fsdp_offload_params: true,
  fsdp_sharding_strategy: FULL_SHARD,
  fsdp_state_dict_type: FULL_STATE_DICT,
  fsdp_cpu_offload: true,
  fsdp_rank0_only: true,
  fsdp_use_orig_params: true,
  fsdp_sync_module_states: true,
  fsdp_transformer_layer_cls_to_wrap: "BartEncoderLayer,BartDecoderLayer"
}
machine_rank: 0
main_process_ip: 127.0.0.1
main_process_port: 9697
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 2
use_cpu: false

I also followed the suggestion (Fully Sharded Data Parallel) to prepare the model before preparing the optimizer. But running only sustained for a few dozens of iterations before throwing the OOM:

  File "/shared/homes/my_home/dev/pyvenv_pt21/lib64/python3.8/site-packages/torch/autograd/__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 5.78 GiB. GPU 0 has a total capacity of 31.74 GiB of which 5.26 GiB is free. Including non-PyTorch memory, this process has 26.48 GiB memory in use. Of the allocated memory 24.73 GiB is allocated by PyTorch, and 788.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[2024-01-30 17:33:41,582] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4114313 closing signal SIGTERM
[2024-01-30 17:33:44,404] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 4114312) of binary: /shared/homes/my_home/dev/pyvenv_pt21/bin/python3.8
Traceback (most recent call last):
  File "/shared/homes/my_home/dev/pyvenv_pt21/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/shared/homes/my_home/dev/pyvenv_pt21/lib64/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/shared/homes/my_home/dev/pyvenv_pt21/lib64/python3.8/site-packages/accelerate/commands/launch.py", line 902, in launch_command
    multi_gpu_launcher(args)
  File "/shared/homes/my_home/dev/pyvenv_pt21/lib64/python3.8/site-packages/accelerate/commands/launch.py", line 599, in multi_gpu_launcher
    distrib_run.run(args)
  File "/shared/homes/my_home/dev/pyvenv_pt21/lib64/python3.8/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/shared/homes/my_home/dev/pyvenv_pt21/lib64/python3.8/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/shared/homes/my_home/dev/pyvenv_pt21/lib64/python3.8/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

So, I am wondering how effective FSDP is utilized in Accelerate for a large model with a large dataset. Or, it may be just because I haven’t configured it more efficiently. Could anyone shed some light?

Regards

Topic		Replies	Views
FSDP Auto Wrap does not work using `accelerate` in Multi-GPU Setup 🤗Accelerate	0	275	September 6, 2024
FSDP accelerate.prepare gives OOM. How to load model into single GPU, then distribute shards? 🤗Accelerate	2	1079	January 24, 2024
Accelerate + Gemma2 + FSDP 🤗Accelerate	1	158	August 25, 2024
CUDA Memory with DeepSpeed running on 4 GPUs is the same as 1 GPU DeepSpeed	0	1071	September 13, 2021
Not seeing memory benefit to accelerate/FSDP2 🤗Accelerate	1	14	June 5, 2025

How effective FSDP with Accelerate?

Related topics