Issues with Dataset Loading and Checkpoint Saving using FSDP with HuggingFace Trainer on SLURM Multi-Node Setup

Ryoo72 · April 7, 2025, 2:38pm

Hello,

I’m training a model using a SLURM multi-node setup with accelerate, HuggingFace Trainer, FlashAttention2, and FSDP. I’m working on a shared filesystem, which is mounted on all nodes so that every process can access the same data and output directories.

I launch my training script without explicitly importing the accelerate Python package in code. Instead, I run it like this:

accelerate launch --config_file my_config.yaml train_my_model.py

I have two questions:

1. Dataset Loading
In my training script, I load the dataset like this:

ds = load_dataset(ds_name, split='train')

Do I need to make sure this is executed only by the main process (e.g., using main_process_first() or a file lock) to prevent conflicts between processes in a distributed setting?

Also, given this setup, is it safe to freely use .map() on the dataset inside the training script, or do I need to handle synchronization or process-specific behavior there as well?

2. FSDP Checkpoint Saving
I’m using a shared filesystem, and during the checkpoint saving step, I get the following error. I suspect it’s related to concurrent writing in a multi-node setup with FSDP.

1: [rank9]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2241, in train
1: [rank9]:     return inner_training_loop(
1: [rank9]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2612, in _inner_training_loop
1: [rank9]:     self._maybe_log_save_evaluate(
1: [rank9]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate
1: [rank9]:     self._save_checkpoint(model, trial)
1: [rank9]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3194, in _save_checkpoint
1: [rank9]:     self._save_optimizer_and_scheduler(output_dir)
1: [rank9]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3311, in _save_optimizer_and_scheduler
1: [rank9]:     save_fsdp_optimizer(
1: [rank9]:   File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/fsdp_utils.py", line 200, in save_fsdp_optimizer
1: [rank9]:     dist_cp.save_state_dict(
1: [rank9]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/checkpoint/state_dict_saver.py", line 41, in save_state_dict
1: [rank9]:     return _save_state_dict(
1: [rank9]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/checkpoint/state_dict_saver.py", line 271, in _save_state_dict
1: [rank9]:     central_plan = distW.reduce_scatter("plan", local_step, global_step)
1: [rank9]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/checkpoint/utils.py", line 167, in reduce_scatter
1: [rank9]:     all_data = self.gather_object(local_data)
1: [rank9]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/checkpoint/utils.py", line 106, in gather_object
1: [rank9]:     dist.gather_object(
1: [rank9]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
1: [rank9]:     return func(*args, **kwargs)
1: [rank9]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2529, in gather_object
1: [rank9]:     gather(
1: [rank9]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
1: [rank9]:     return func(*args, **kwargs)
1: [rank9]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3105, in gather
1: [rank9]:     work = default_pg.gather(output_tensors, input_tensors, opts)
1: [rank9]: RuntimeError: NCCL Error 2: unhandled system error (run with NCCL_DEBUG=INFO for details)

How should I configure fsdp_config properly in this case to avoid such errors during saving?

Thank you!

John6666 · April 7, 2025, 2:54pm

It seems to be an unsolved issue, but there may be a way to avoid it (such as downgrading the accelerate library).

github.com/huggingface/accelerate

Communication/NCCL failures training FSDP in multi-node environment with SLURM

opened 02:47AM - 16 Feb 24 UTC

closed 03:06PM - 24 Apr 24 UTC

jpgard

### System Info ```Shell output of `accelerate env`: (note as shown below th…is prints the DEFAULT accelerate config and not the exact config being used for this job) - `Accelerate` version: 0.27.2 - Platform: Linux-5.15.0-1037-aws-x86_64-with-glibc2.17 - Python version: 3.8.18 - Numpy version: 1.24.4 - PyTorch version (GPU?): 2.2.0+cu121 (False) - PyTorch XPU available: False - PyTorch NPU available: False - System RAM: 123.82 GB - `Accelerate` default config: - compute_environment: LOCAL_MACHINE - distributed_type: FSDP - mixed_precision: bf16 - use_cpu: False - debug: False - num_processes: 8 - machine_rank: 0 - num_machines: 2 - main_process_ip: - main_process_port: 1234 - rdzv_backend: static - same_network: True - main_training_function: main - fsdp_config: {'fsdp_auto_wrap_policy': 'TRANSFORMER_BASED_WRAP', 'fsdp_backward_prefetch': 'BACKWARD_PRE', 'fsdp_cpu_ram_efficient_loading': True, 'fsdp_forward_prefetch': False, 'fsdp_offload_params': False, 'fsdp_sharding_strategy': 'FULL_SHARD', 'fsdp_state_dict_type': 'FULL_STATE_DICT', 'fsdp_sync_module_states': True, 'fsdp_transformer_layer_cls_to_wrap': 'LlamaDecoderLayer', 'fsdp_use_orig_params': False} - downcast_bf16: no - tpu_use_cluster: False - tpu_use_sudo: False - tpu_env: [] ``` ### Information - [ ] The official example scripts - [X] My own modified scripts ### Tasks - [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported `no_trainer` script in the `examples` folder of the `transformers` repo (such as `run_no_trainer_glue.py`) - [X] My own task or dataset (give details below) ### Reproduction I'm attempting to train a model with multi-node training, using SLURM scheduler. I am launching the job in 2 nodes with 8 GPUs each. My training script runs fine in a single-node environment with FSDP, and it *starts* fine in the multi-node setting -- until there is actual communication required between the nodes. However, when the script gets to the parts that actually initialize multi-node training, it seems the processes are having issues communicating across nodes. I can see the logging output from all 16 processes, the data is loaded, etc. However, the script fails at `accelerator.prepare()`. Specifically I see the stack trace containing these lines (complete stack trace is below): ``` torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. ``` Note that it is possible I have misconfigured the accelerate config, or the SLURM settings (tasks/node counts etc), but based on the example [here](https://gist.github.com/pacman100/1cb1f17b2f1b3139a63b764263e70b25) with corresponding FSDP config [here](https://github.com/pacman100/DHS-LLM-Workshop/blob/6093a2320543c2ac903a1fbb9b034ea714db43c9/personal_copilot/training/configs/fsdp_config.yaml#L4) this seems to be set up correctly to me. Any thoughts would be appreciated, I've tried lots of different configurations and tinkering with the environment to make sure the versions of pytorch/NCCL/accelerate are all compatible as well. ## Contents of fsdp_config_base.yaml I am using: ``` compute_environment: LOCAL_MACHINE distributed_type: FSDP downcast_bf16: 'no' fsdp_config: fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP fsdp_backward_prefetch_policy: BACKWARD_PRE fsdp_forward_prefetch: false fsdp_offload_params: false fsdp_sharding_strategy: 1 fsdp_state_dict_type: FULL_STATE_DICT fsdp_sync_module_states: true fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer fsdp_use_orig_params: true machine_rank: 0 main_training_function: main num_machines: 1 num_processes: 8 mixed_precision: bf16 rdzv_backend: c10d same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false ``` ## Relevant chunks of the sbatch script I am launching the job with: ``` #!/bin/bash #SBATCH --nodes=2 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=10 #SBATCH --partition=a40x #SBATCH --gpus-per-node=8 #SBATCH --exclusive #SBATCH --time=04-00:00:00 #SBATCH --account=nextgends #SBATCH --chdir=/admin/home-jpgard/rtfm #SBATCH --output=/admin/home-jpgard/rtfm/slurm-out/%j.out #SBATCH --err=/admin/home-jpgard/rtfm/slurm-out/%j.err #SBATCH --exclude=ip-10-0-201-106,ip-10-0-202-154 ################# code block adapted from https://gist.github.com/pacman100/1cb1f17b2f1b3139a63b764263e70b25 set -x -e # force crashing on nccl issues like hanging broadcast export NCCL_ASYNC_ERROR_HANDLING=1 export NCCL_DEBUG=INFO GPUS_PER_NODE=8 NNODES=$SLURM_NNODES NUM_PROCESSES=$(expr $NNODES \* $GPUS_PER_NODE) # A function to parse slurm's node notation when it uses 'bracketed' values for SLURM_JOB_NODELIST # Function to parse and expand node list from SLURM_JOB_NODELIST expand_nodes() { # The input is something like "ip-10-0-231-[1,86]" local nodelist=$1 # Replace '[' and ']' with space and split the string local base=$(echo $nodelist | sed -E 's/\[([0-9]+),([0-9]+)\]/ \1 \2 /') # Read into array read -a parts <<< "$base" # Check if we have three parts: prefix, start, end if [ ${#parts[@]} -eq 3 ]; then local prefix=${parts[0]} local start=${parts[1]} local end=${parts[2]} # Generate sequence for i in $(seq $start $end); do echo "${prefix}${i}" return # Return after first IP to mimic head node behavior done else # If the format does not include a range, just echo the input echo $nodelist fi } # Extract the first node name from SLURM_JOB_NODELIST # This assumes the format "node-list: ip-10-0-209-157,ip-10-0-231-1" and extracts the first node name echo "SLURM_JOB_NODELIST is $SLURM_JOB_NODELIST" node_name=$(echo $SLURM_JOB_NODELIST | sed 's/node-list: //' | cut -d, -f1) # Now, resolve this node name to an IP address # Using getent ahosts (You can also use nslookup if getent does not work as expected) MASTER_ADDR=$(getent ahosts $node_name | head -n 1 | awk '{print $1}') # Check if we got an IP if [ ! -z "$MASTER_ADDR" ]; then echo "Head node IP: $MASTER_ADDR" else echo "Failed to resolve head node IP address" # Extract the first node name from SLURM_JOB_NODELIST and expand if needed node_name=$(expand_nodes $SLURM_JOB_NODELIST) # Now, resolve this node name to an IP address using getent ahosts MASTER_ADDR=$(getent ahosts $node_name | head -n 1 | awk '{print $1}') echo "Head node IP after parsing: $MASTER_ADDR" fi MASTER_PORT=6999 # OTHER LAUNCHERS CAN BE USED HERE export LAUNCHER="/admin/home-jpgard/miniconda3/envs/rtfm/bin/accelerate launch \ --config_file /admin/home-jpgard/rtfm/fsdp_config_base.yaml \ --num_processes $NUM_PROCESSES \ --main_process_ip $MASTER_ADDR \ --num_machines $NNODES \ --main_process_port $MASTER_PORT \ --machine_rank \$SLURM_PROCID \ " echo "SLURM_JOB_ID is ${SLURM_JOB_ID}" echo 'activating conda environment' source /admin/home-jpgard/.bashrc source /admin/home-jpgard/miniconda3/etc/profile.d/conda.sh which conda conda activate rtfm which python export PROGRAM="\ scripts/train.py \ --more-args-here --bf16 True \ --use_amp \ " export CMD="$LAUNCHER $PROGRAM" echo "about to run ${CMD}" /opt/slurm/bin/srun --jobid $SLURM_JOBID /usr/bin/bash -c "$CMD" ``` ## Full stack trace: ``` Traceback (most recent call last): File "scripts/train.py", line 582, in <module> main( File "scripts/train.py", line 206, in main model = accelerator.prepare(model) File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/accelerator.py", line 1228, in prepare result = tuple( File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/accelerator.py", line 1229, in <genexpr> self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement) File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/accelerator.py", line 1105, in _prepare_one return self.prepare_model(obj, device_placement=device_placement) File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/accelerator.py", line 1387, in prepare_model model = FSDP(model, **kwargs) File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 477, in __init__ _auto_wrap( File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/_wrap_utils.py", line 101, in _auto_wrap _recursive_wrap(**recursive_wrap_kwargs, **root_kwargs) # type: ignore[arg-type] File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap wrapped_child, num_wrapped_params = _recursive_wrap( File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap wrapped_child, num_wrapped_params = _recursive_wrap( File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap wrapped_child, num_wrapped_params = _recursive_wrap( File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 561, in _recursive_wrap return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 490, in _wrap return wrapper_cls(module, **kwargs) File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 503, in __init__ _init_param_handle_from_module( File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py", line 587, in _init_param_handle_from_module _sync_module_params_and_buffers( File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py", line 1068, in _sync_module_params_and_buffers _sync_params_and_buffers( File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/utils.py", line 303, in _sync_params_and_buffers dist._broadcast_coalesced( torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. Last error: socketStartConnect: Connect to fe80::a849:bbff:fe73:19bd%veth8299cd8<54785> failed : Software caused connection abort Traceback (most recent call last): File "scripts/train.py", line 582, in <module> main( File "scripts/train.py", line 206, in main model = accelerator.prepare(model) File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/accelerator.py", line 1228, in prepare result = tuple( File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/accelerator.py", line 1229, in <genexpr> self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement) File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/accelerator.py", line 1105, in _prepare_one return self.prepare_model(obj, device_placement=device_placement) File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/accelerator.py", line 1387, in prepare_model model = FSDP(model, **kwargs) File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 477, in __init__ _auto_wrap( File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/_wrap_utils.py", line 101, in _auto_wrap _recursive_wrap(**recursive_wrap_kwargs, **root_kwargs) # type: ignore[arg-type] File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap wrapped_child, num_wrapped_params = _recursive_wrap( File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap wrapped_child, num_wrapped_params = _recursive_wrap( File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap wrapped_child, num_wrapped_params = _recursive_wrap( File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 561, in _recursive_wrap return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 490, in _wrap return wrapper_cls(module, **kwargs) File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 503, in __init__ _init_param_handle_from_module( File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py", line 587, in _init_param_handle_from_module _sync_module_params_and_buffers( File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py", line 1068, in _sync_module_params_and_buffers _sync_params_and_buffers( File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/utils.py", line 303, in _sync_params_and_buffers dist._broadcast_coalesced( torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. Last error: /admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/utils/launch.py:192: FutureWarning: `fsdp_backward_prefetch_policy` is deprecated and will be removed in version 0.27.0 of 🤗 Accelerate. Use `fsdp_backward_prefetch` instead warnings.warn( [2024-02-16 02:42:00,874] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4025500 closing signal SIGTERM [2024-02-16 02:42:00,875] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4025502 closing signal SIGTERM [2024-02-16 02:42:00,875] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4025503 closing signal SIGTERM [2024-02-16 02:42:00,875] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4025504 closing signal SIGTERM [2024-02-16 02:42:00,876] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4025505 closing signal SIGTERM [2024-02-16 02:42:00,876] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4025506 closing signal SIGTERM [2024-02-16 02:42:00,876] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4025507 closing signal SIGTERM /admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/utils/launch.py:192: FutureWarning: `fsdp_backward_prefetch_policy` is deprecated and will be removed in version 0.27.0 of 🤗 Accelerate. Use `fsdp_backward_prefetch` instead warnings.warn( [2024-02-16 02:42:00,885] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1267868 closing signal SIGTERM [2024-02-16 02:42:00,885] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1267870 closing signal SIGTERM [2024-02-16 02:42:00,886] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1267872 closing signal SIGTERM [2024-02-16 02:42:00,886] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1267873 closing signal SIGTERM [2024-02-16 02:42:00,886] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1267874 closing signal SIGTERM [2024-02-16 02:42:00,886] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1267875 closing signal SIGTERM [2024-02-16 02:42:00,886] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1267876 closing signal SIGTERM [2024-02-16 02:42:03,298] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 1267869) of binary: /admin/home-jpgard/miniconda3/envs/rtfm/bin/python Traceback (most recent call last): File "/admin/home-jpgard/miniconda3/envs/rtfm/bin/accelerate", line 8, in <module> sys.exit(main()) File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main args.func(args) File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/commands/launch.py", line 1010, in launch_command multi_gpu_launcher(args) File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/commands/launch.py", line 672, in multi_gpu_launcher distrib_run.run(args) File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ scripts/train.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-02-16_02:42:00 host : ip-10-0-209-157.us-west-2.compute.internal rank : 9 (local_rank: 1) exitcode : 1 (pid: 1267869) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ``` ### Expected behavior I expect training to work in the distributed setting just as it does in the single-node setting.

github.com/NVIDIA/nccl

ncclInternalError during torch all_gather_object

opened 06:41PM - 15 Aug 23 UTC

pritamdamania87

The error encountered is as follows: ``` File "torch/distributed/distribut…ed_c10d.py", line 1451, in wrapper return func(*args, **kwargs) File "torch/distributed/distributed_c10d.py", line 2053, in all_gather_object all_gather(object_size_list, local_size, group=group) File "torch/distributed/distributed_c10d.py", line 1451, in wrapper return func(*args, **kwargs) File "torch/distributed/distributed_c10d.py", line 2448, in all_gather work = default_pg.allgather([tensor_list], [tensor]) torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1296, internal error - please report this issue to the NCCL developers, NCCL version 2.18.1 ```` PyTorch API being used is: https://github.com/pytorch/pytorch/blob/main/torch/distributed/distributed_c10d.py#L2322. NCCL_DEBUG output is as follows: ``` NCCL version 2.18.1+cuda12.0 node:2134736:2285781 [0] include/alloc.h:178 NCCL WARN Cuda failure 'out of memory' node:2134736:2285781 [0] include/alloc.h:185 NCCL WARN Failed to CUDA calloc 6291456 bytes node:2134736:2285781 [0] proxy.cc:1518 NCCL WARN [Proxy Service 93] Failed to execute operation Setup from rank 93, retcode 1 node:2134736:2285759 [0] misc/socket.cc:49 NCCL WARN socketProgress: Connection closed by remote peer node.cm.cluster<40031> node:2134736:2285759 [0] proxy.cc:1143 NCCL WARN Socket recv failed while polling for opId=0x152b48786ea0 ``` The NCCL log reports a CUDA OOM, but the number of bytes that were attempted to be allocated seems to be just 6MB which shouldn't cause an OOM.

github.com/huggingface/accelerate

fsdp checkpoint saving leads to NCCL WARN Cuda failure 2 'out of memory'

opened 07:17PM - 10 Nov 24 UTC

closed 03:06PM - 20 Dec 24 UTC

edchengg

### System Info ```Shell accelerate 1.0.1 torch … 2.4.0 python 3.10.14 transformers 4.45.2 FullyShardedDataParallelPlugin(sharding_strategy=<ShardingStrategy.FULL_SHARD: 1>, backward_prefetch=None, mixed_precision_policy=MixedPrecision(param_dtype=torch.bfloat16, reduce_dtype=torch.bfloat16, buffer_dtype=torch.bfloat16, keep_low_precision_grads=False, cast_forward_inputs=False, cast_root_forward_inputs=True, _module_classes_to_ignore=(<class 'torch.nn.modules.batchnorm._BatchNorm'>,)), auto_wrap_policy=<function transformer_auto_wrap_policy at 0x155486506050>, cpu_offload=CPUOffload(offload_params=False), ignored_modules=None, state_dict_type=<StateDictType.SHARDED_STATE_DICT: 3>, state_dict_config=ShardedStateDictConfig(offload_to_cpu=True, _use_dtensor=False), optim_state_dict_config=ShardedOptimStateDictConfig(offload_to_cpu=True, _use_dtensor=False), limit_all_gathers=True, use_orig_params=True, param_init_fn=<function FullyShardedDataParallelPlugin.__post_init__.<locals>.<lambda> at 0x155230e00ee0>, sync_module_states=True, forward_prefetch=False, activation_checkpointing=True, cpu_ram_efficient_loading=True, transformer_cls_names_to_wrap=['Qwen2DecoderLayer'], min_num_params=None) ``` ### Information - [ ] The official example scripts - [X] My own modified scripts ### Tasks - [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported `no_trainer` script in the `examples` folder of the `transformers` repo (such as `run_no_trainer_glue.py`) - [X] My own task or dataset (give details below) ### Reproduction when I use FSDP to train a model on 16-32 node cluster. Sometime it hit this error "NCCL WARN Cuda failure 2 'out of memory'" when save the model checkpoint. resulting in an empty folder. sometimes it works fine. Anyone had similar experience before? This really puzzles me as i already use offload_to_cpu = True for SHARDED_STATE_DICT ``` Traceback (most recent call last): [rank150]: File "/finetune_orm_bt_listwise_fsdp.py", line 353, in <module> [rank150]: trainer.train(resume_from_checkpoint=script_args.resume_from_checkpoint) [rank150]: File "lib/python3.10/site-packages/transformers/trainer.py", line 2052, in train [rank150]: return inner_training_loop( [rank150]: File "lib/python3.10/site-packages/transformers/trainer.py", line 2467, in _inner_training_loop [rank150]: self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval) [rank150]: File "lib/python3.10/site-packages/transformers/trainer.py", line 2918, in _maybe_log_save_evaluate [rank150]: self._save_checkpoint(model, trial, metrics=metrics) [rank150]: File "lib/python3.10/site-packages/transformers/trainer.py", line 3012, in _save_checkpoint [rank150]: self._save_optimizer_and_scheduler(output_dir) [rank150]: File "lib/python3.10/site-packages/transformers/trainer.py", line 3147, in _save_optimizer_and_scheduler [rank150]: save_fsdp_model( [rank150]: File "lib/python3.10/site-packages/accelerate/utils/fsdp_utils.py", line 107, in save_fsdp_model [rank150]: dist_cp.save_state_dict( [rank150]: File "lib/python3.10/site-packages/typing_extensions.py", line 2853, in wrapper [rank150]: return arg(*args, **kwargs) [rank150]: File "lib/python3.10/site-packages/torch/distributed/checkpoint/state_dict_saver.py", line 47, in save_state_dict [rank150]: return _save_state_dict( [rank150]: File "lib/python3.10/site-packages/torch/distributed/checkpoint/state_dict_saver.py", line 316, in _save_state_dict [rank150]: central_plan: SavePlan = distW.reduce_scatter("plan", local_step, global_step) [rank150]: File "lib/python3.10/site-packages/torch/distributed/checkpoint/utils.py", line 190, in reduce_scatter [rank150]: result = self.scatter_object(all_results) [rank150]: File "lib/python3.10/site-packages/torch/distributed/checkpoint/utils.py", line 135, in scatter_object [rank150]: dist.scatter_object_list( [rank150]: File "lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper [rank150]: return func(*args, **kwargs) [rank150]: File "lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3019, in scatter_object_list [rank150]: scatter( [rank150]: File "lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper [rank150]: return func(*args, **kwargs) [rank150]: File `"lib/python3.10/site-packages/torch/distributed/distributed_c10d.py",` line 3480, in scatter [rank150]: work = default_pg.scatter(output_tensors, input_tensors, opts) [rank150]: RuntimeError: NCCL Error 1: unhandled cuda error (run with NCCL_DEBUG=INFO for details) ``` ` FullyShardedDataParallelPlugin(sharding_strategy=<ShardingStrategy.FULL_SHARD: 1>, backward_prefetch=None, mixed_precision_policy=MixedPrecision(param_dtype=torch.bfloat16, reduce_dtype=torch.bfloat16, buffer_dtype=torch.bfloat16, keep_low_precision_grads=False, cast_forward_inputs=False, cast_root_forward_inputs=True, _module_classes_to_ignore=(<class 'torch.nn.modules.batchnorm._BatchNorm'>,)), auto_wrap_policy=<function transformer_auto_wrap_policy at 0x155486506050>, cpu_offload=CPUOffload(offload_params=False), ignored_modules=None, state_dict_type=<StateDictType.SHARDED_STATE_DICT: 3>, state_dict_config=ShardedStateDictConfig(offload_to_cpu=True, _use_dtensor=False), optim_state_dict_config=ShardedOptimStateDictConfig(offload_to_cpu=True, _use_dtensor=False), limit_all_gathers=True, use_orig_params=True, param_init_fn=<function FullyShardedDataParallelPlugin.__post_init__.<locals>.<lambda> at 0x155230e00ee0>, sync_module_states=True, forward_prefetch=False, activation_checkpointing=True, cpu_ram_efficient_loading=True, transformer_cls_names_to_wrap=['Qwen2DecoderLayer'], min_num_params=None)` ### Expected behavior save the ckpt fine

Topic		Replies	Views
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 🤗Accelerate	1	690	August 15, 2024
Use `accelerate` in SLURM environment 🤗Accelerate	9	3250	March 3, 2023
Accelerate FSDP training \|\| RuntimeError : Forward oder differ across ranks 🤗Accelerate	0	496	December 19, 2023
Troubleshooting help? Everything just hangs 🤗Accelerate	2	3486	July 12, 2022
How to run an end to end example of distributed data parallel with hugging face's trainer api (ideally on a single node multiple gpus)? Intermediate	17	18184	September 6, 2023

Issues with Dataset Loading and Checkpoint Saving using FSDP with HuggingFace Trainer on SLURM Multi-Node Setup

Related topics