Can anyone help me with a boilerplate or changes I need to do if I want to run trainer API for data parallel on multi-node setup.
Hey! If you’re looking to run the Trainer API for data parallel on a multi-node setup, here are some key things to check and set up:
-
Enable
torchrun
ordeepspeed
– You’ll need to launch your training script usingtorchrun
(PyTorch) orDeepSpeed
for multi-node training. -
Set
distributed_training
parameters – In your training arguments, setddp_find_unused_parameters=False
and make suretorch.distributed.launch
ortorchrun
is configured correctly. -
Check environment variables – Each node should have correct settings for
MASTER_ADDR
,MASTER_PORT
,WORLD_SIZE
, andRANK
. -
Ensure all nodes communicate – Make sure SSH is set up, and all nodes can see each other. You might need to set up NCCL backend settings properly.
-
Modify training script if needed – If you’re not using Hugging Face’s
Trainer
, ensure your script correctly initializestorch.distributed.init_process_group()
.
If you’re running into issues, sharing your setup details (framework, error messages) would help troubleshoot!
Some follow up questions:
Can I do Multi node training using accelerator instead of deep speed or torchrun?
Also I’m using a system that allocates node run time, so I don’t have the master ip in hand when I schedule my job request. Can you suggest me what should I do in that scenario for setting up my master ip?
Yes. You can use accelerate
for multi-node training instead of DeepSpeed or torchrun
. Hugging Face’s accelerate
library simplifies distributed training and can handle multi-node setups efficiently. You’ll need to configure your accelerate config
settings properly to enable multi-node execution.
For your second question—since you don’t have the master IP beforehand due to dynamic node allocation, you can use one of these approaches:
- Use a shared storage system – Some clusters provide a shared filesystem where the first node can write its IP to a file, and others can read it.
- Service-based discovery – If your cluster supports job schedulers like SLURM, you can use
scontrol show hostname
to get node addresses dynamically. - Auto-discovery with environment variables – Some cloud platforms provide a way to fetch the master node dynamically using metadata services.
- Manual assignment with retries – If no automated method works, you might need to implement a retry mechanism where worker nodes wait and poll for the master node’s IP before joining the training process.
If you’re using accelerate
, it can sometimes handle discovery automatically—make sure to explore its multi-node configuration options! Let me know if you need specific guidance based on your setup.
Thanks for your help.
Like I was able to figure out issues at my end.
Firstly many of the discuss forums mentioned to specify num_process to total number of process i.e. collectively of both GPU. Because of that I was not able to spin off both the mode. In my setup per node GPU for num_process worked.
Second, my code was saving checkpoints and then using it later so I was in need to save it on both node using save_on_each_node argument of trainingArgument