Cannot run multi GPU training on SLURM

merci00001 · March 16, 2025, 12:09pm

Hi,
Im currently trying to setup multi gpu training using accelerate with the for training GRPO from the TRL library. Single GPU training works, but as soon as I go to multi GPU, everything fails and i cant figure out why.

The error:

[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/mgroepl/hFace/test.py", line 92, in <module>
[rank1]:     trainer.train()
[rank1]:   File "/itet-stor/mgroepl/net_scratch/conda_envs/hFace/lib/python3.12/site-packages/transformers/trainer.py", line 2241, in train
[rank1]:     return inner_training_loop(
[rank1]:            ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/itet-stor/mgroepl/net_scratch/conda_envs/hFace/lib/python3.12/site-packages/transformers/trainer.py", line 2365, in _inner_training_loop
[rank1]:     model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank1]:                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/itet-stor/mgroepl/net_scratch/conda_envs/hFace/lib/python3.12/site-packages/accelerate/accelerator.py", line 1389, in prepare
[rank1]:     result = tuple(
[rank1]:              ^^^^^^
[rank1]:   File "/itet-stor/mgroepl/net_scratch/conda_envs/hFace/lib/python3.12/site-packages/accelerate/accelerator.py", line 1390, in <genexpr>
[rank1]:     self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank1]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/itet-stor/mgroepl/net_scratch/conda_envs/hFace/lib/python3.12/site-packages/accelerate/accelerator.py", line 1263, in _prepare_one
[rank1]:     return self.prepare_model(obj, device_placement=device_placement)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/itet-stor/mgroepl/net_scratch/conda_envs/hFace/lib/python3.12/site-packages/accelerate/accelerator.py", line 1522, in prepare_model
[rank1]:     model = torch.nn.parallel.DistributedDataParallel(
[rank1]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/itet-stor/mgroepl/net_scratch/conda_envs/hFace/lib/python3.12/site-packages/torch/nn/parallel/distributed.py", line 825, in __init__
[rank1]:     _verify_param_shape_across_processes(self.process_group, parameters)
[rank1]:   File "/itet-stor/mgroepl/net_scratch/conda_envs/hFace/lib/python3.12/site-packages/torch/distributed/utils.py", line 288, in _verify_param_shape_across_processes
[rank1]:     return dist._verify_params_across_processes(process_group, tensors, logger)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: RuntimeError: CUDA error: named symbol not found
[rank1]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

And my current batch file to run the multi gpu:

#!/bin/bash

#SBATCH --output=/home/mgroepl/log/%j.out     # where to store the output (%j is the JOBID), subdirectory "log" must exist
#SBATCH --error=/home/mgroepl/log/%j.out   # where to store error messages
#SBATCH --nodes=1                   # number of nodes
#SBATCH --ntasks-per-node=1         # number of MP tasks
#SBATCH --gres=gpu:2  
#SBATCH --constraint=ampere
# Load Conda (Important for Non-Interactive Shells)

source /itet-stor/mgroepl/net_scratch/conda/etc/profile.d/conda.sh
conda init bash


conda init bash
conda activate hFace

export PYTHONPATH=$PYTHONPATH:/itet-stor/mgroepl/net_scratch/trl
export HF_HOME=/itet-stor/mgroepl/net_scratch/hCache
export GPUS_PER_NODE=2

head_node_ip=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
######################

    
srun accelerate launch \
    --num_processes 1 \
    --num_machines $SLURM_NNODES \
    --rdzv_backend c10d \
    --main_process_ip $head_node_ip \
    --main_process_port 29500 \
    /home/mgroepl/hFace/test.py



echo "Running on node: $(hostname)"
echo "In directory:    $(pwd)"
echo "Starting on:     $(date)"
echo "SLURM_JOB_ID:    ${SLURM_JOB_ID}"



# Send more noteworthy information to the output log
echo "Finished at:     $(date)"

# End the script with exit code 0
exit 0

Would appreciate any help

John6666 · March 16, 2025, 4:23pm

Seems ongoing issue…

github.com/huggingface/trl

[Bug] ORPO Trainer hangs with multi-gpu on step 0

opened 08:53AM - 13 Mar 25 UTC

NanoCode012

🐛 bug ⚡accelerate 🏋 ORPO

### Reproduction Use the example script in repo with its sample commands but wi…th accelerate. Latest commit: [4871c82](https://github.com/huggingface/trl/commit/4871c82b0cd1caae72522182f9171ea069481250) ```bash accelerate launch examples/scripts/orpo.py --dataset_name trl-internal-testing/hh-rlhf-helpful-base-trl-style --model_name_or_path=gpt2 --per_device_train_batch_size 4 --max_steps 1000 --learning_rate 8e-6 --gradient_accumulation_steps 1 --logging_steps 10 --eval_steps 500 --output_dir="gpt2-aligned-orpo" --warmup_steps 150 --bf16 --logging_first_step --no_remove_unused_columns --log_level detail ``` The training would hang on step 0. I have tracked it down to these two lines https://github.com/huggingface/trl/blob/4871c82b0cd1caae72522182f9171ea069481250/trl/trainer/orpo_trainer.py#L847-L850 These are really large tensors that are being propagated. ``` torch.Size([4, 223, 50257]) torch.Size([4, 223, 50257]) ``` Solution: Call `.detach().mean()` on them prior to gather. Happy to make the PR if we decide to average them prior to broadcast or sum like in KTOTrainer https://github.com/huggingface/trl/blob/4871c82b0cd1caae72522182f9171ea069481250/trl/trainer/kto_trainer.py#L1270-L1272 --- Unrelated note: these two lines below should also be `.detach()` as I noticed they have graph on them. https://github.com/huggingface/trl/blob/4871c82b0cd1caae72522182f9171ea069481250/trl/trainer/orpo_trainer.py#L852-L853 Credit to `morphism` in discord who helped track down root PR cause and provided hints. ### System Info - Platform: Linux-6.5.0-45-generic-x86_64-with-glibc2.35 - Python version: 3.11.11 - TRL version: 0.16.0.dev0+4871c82 - PyTorch version: 2.5.1+cu124 - CUDA device(s): NVIDIA A40, NVIDIA A40 - Transformers version: 4.49.0 - Accelerate version: 1.3.0 - Accelerate config: not found - Datasets version: 3.2.0 - HF Hub version: 0.28.1 - bitsandbytes version: 0.45.2 - DeepSpeed version: 0.16.1 - Diffusers version: not installed - Liger-Kernel version: 0.5.3 - LLM-Blender version: not installed - OpenAI version: not installed - PEFT version: 0.14.0 - vLLM version: not installed ### Checklist - [x] I have checked that my issue isn't already filed (see [open issues](https://github.com/huggingface/trl/issues?q=is%3Aissue)) - [x] I have included my system information - [x] Any code provided is minimal, complete, and reproducible ([more on MREs](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks)) - [x] Any code provided is properly formatted in code blocks, (no screenshot, [more on code blocks](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks)) - [x] Any traceback provided is complete

Topic		Replies	Views
Multi-GPU Training sometimes working with 2GPU, but never more than 2 🤗Accelerate	5	3016	August 8, 2024
Tranier not starting on multi-GPU setting 🤗Transformers	1	1063	February 15, 2024
What does "--multi_gpu" do under the hood? (and how to use it) 🤗Accelerate	7	6500	May 31, 2023
Use `accelerate` in SLURM environment 🤗Accelerate	9	3209	March 3, 2023
Multi-GPU Distributed Training using Accelerate on Windows 🤗Accelerate	0	1540	August 9, 2023

Cannot run multi GPU training on SLURM

Related topics