Trainer API for data parallel on multi-node

Can anyone help me with a boilerplate or changes I need to do if I want to run trainer API for data parallel on multi-node setup.

1 Like

Hey! If you’re looking to run the Trainer API for data parallel on a multi-node setup, here are some key things to check and set up:

  1. Enable torchrun or deepspeed – You’ll need to launch your training script using torchrun (PyTorch) or DeepSpeed for multi-node training.

  2. Set distributed_training parameters – In your training arguments, set ddp_find_unused_parameters=False and make sure torch.distributed.launch or torchrun is configured correctly.

  3. Check environment variables – Each node should have correct settings for MASTER_ADDR, MASTER_PORT, WORLD_SIZE, and RANK.

  4. Ensure all nodes communicate – Make sure SSH is set up, and all nodes can see each other. You might need to set up NCCL backend settings properly.

  5. Modify training script if needed – If you’re not using Hugging Face’s Trainer, ensure your script correctly initializes torch.distributed.init_process_group().

If you’re running into issues, sharing your setup details (framework, error messages) would help troubleshoot! :rocket:

1 Like

Some follow up questions:

Can I do Multi node training using accelerator instead of deep speed or torchrun?

Also I’m using a system that allocates node run time, so I don’t have the master ip in hand when I schedule my job request. Can you suggest me what should I do in that scenario for setting up my master ip?

1 Like

Yes. You can use accelerate for multi-node training instead of DeepSpeed or torchrun. Hugging Face’s accelerate library simplifies distributed training and can handle multi-node setups efficiently. You’ll need to configure your accelerate config settings properly to enable multi-node execution.

For your second question—since you don’t have the master IP beforehand due to dynamic node allocation, you can use one of these approaches:

  1. Use a shared storage system – Some clusters provide a shared filesystem where the first node can write its IP to a file, and others can read it.
  2. Service-based discovery – If your cluster supports job schedulers like SLURM, you can use scontrol show hostname to get node addresses dynamically.
  3. Auto-discovery with environment variables – Some cloud platforms provide a way to fetch the master node dynamically using metadata services.
  4. Manual assignment with retries – If no automated method works, you might need to implement a retry mechanism where worker nodes wait and poll for the master node’s IP before joining the training process.

If you’re using accelerate, it can sometimes handle discovery automatically—make sure to explore its multi-node configuration options! Let me know if you need specific guidance based on your setup. :rocket:

1 Like

Thanks for your help.

Like I was able to figure out issues at my end.

Firstly many of the discuss forums mentioned to specify num_process to total number of process i.e. collectively of both GPU. Because of that I was not able to spin off both the mode. In my setup per node GPU for num_process worked.

Second, my code was saving checkpoints and then using it later so I was in need to save it on both node using save_on_each_node argument of trainingArgument

1 Like