Hi,
Im currently trying to setup multi gpu training using accelerate with the for training GRPO from the TRL library. Single GPU training works, but as soon as I go to multi GPU, everything fails and i cant figure out why.
The error:
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/mgroepl/hFace/test.py", line 92, in <module>
[rank1]:     trainer.train()
[rank1]:   File "/itet-stor/mgroepl/net_scratch/conda_envs/hFace/lib/python3.12/site-packages/transformers/trainer.py", line 2241, in train
[rank1]:     return inner_training_loop(
[rank1]:            ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/itet-stor/mgroepl/net_scratch/conda_envs/hFace/lib/python3.12/site-packages/transformers/trainer.py", line 2365, in _inner_training_loop
[rank1]:     model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank1]:                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/itet-stor/mgroepl/net_scratch/conda_envs/hFace/lib/python3.12/site-packages/accelerate/accelerator.py", line 1389, in prepare
[rank1]:     result = tuple(
[rank1]:              ^^^^^^
[rank1]:   File "/itet-stor/mgroepl/net_scratch/conda_envs/hFace/lib/python3.12/site-packages/accelerate/accelerator.py", line 1390, in <genexpr>
[rank1]:     self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank1]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/itet-stor/mgroepl/net_scratch/conda_envs/hFace/lib/python3.12/site-packages/accelerate/accelerator.py", line 1263, in _prepare_one
[rank1]:     return self.prepare_model(obj, device_placement=device_placement)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/itet-stor/mgroepl/net_scratch/conda_envs/hFace/lib/python3.12/site-packages/accelerate/accelerator.py", line 1522, in prepare_model
[rank1]:     model = torch.nn.parallel.DistributedDataParallel(
[rank1]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/itet-stor/mgroepl/net_scratch/conda_envs/hFace/lib/python3.12/site-packages/torch/nn/parallel/distributed.py", line 825, in __init__
[rank1]:     _verify_param_shape_across_processes(self.process_group, parameters)
[rank1]:   File "/itet-stor/mgroepl/net_scratch/conda_envs/hFace/lib/python3.12/site-packages/torch/distributed/utils.py", line 288, in _verify_param_shape_across_processes
[rank1]:     return dist._verify_params_across_processes(process_group, tensors, logger)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: RuntimeError: CUDA error: named symbol not found
[rank1]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
And my current batch file to run the multi gpu:
#!/bin/bash
#SBATCH --output=/home/mgroepl/log/%j.out     # where to store the output (%j is the JOBID), subdirectory "log" must exist
#SBATCH --error=/home/mgroepl/log/%j.out   # where to store error messages
#SBATCH --nodes=1                   # number of nodes
#SBATCH --ntasks-per-node=1         # number of MP tasks
#SBATCH --gres=gpu:2  
#SBATCH --constraint=ampere
# Load Conda (Important for Non-Interactive Shells)
source /itet-stor/mgroepl/net_scratch/conda/etc/profile.d/conda.sh
conda init bash
conda init bash
conda activate hFace
export PYTHONPATH=$PYTHONPATH:/itet-stor/mgroepl/net_scratch/trl
export HF_HOME=/itet-stor/mgroepl/net_scratch/hCache
export GPUS_PER_NODE=2
head_node_ip=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
######################
    
srun accelerate launch \
    --num_processes 1 \
    --num_machines $SLURM_NNODES \
    --rdzv_backend c10d \
    --main_process_ip $head_node_ip \
    --main_process_port 29500 \
    /home/mgroepl/hFace/test.py
echo "Running on node: $(hostname)"
echo "In directory:    $(pwd)"
echo "Starting on:     $(date)"
echo "SLURM_JOB_ID:    ${SLURM_JOB_ID}"
# Send more noteworthy information to the output log
echo "Finished at:     $(date)"
# End the script with exit code 0
exit 0
Would appreciate any help