Training on multiple GPUs with multi file script

Sajan · October 16, 2023, 11:23am

I have a training script that takes the training arguments, creates a directory for the experiment run, processes the annotations from the files passed and trains a DETR model. My dataset class is custom and inherits the torch Dataset class. These things are handled from a main script which is the entrypoint.

I moved this configuration to use accelerate to train on multiple GPUs. I followed this tutorial and changed the relevant parts. I started the training with nohup python3 main.py --flags... &. This uses only 1 GPU out of 4 and printing accelerator.num_processes returns 1.

I tried running with nohup accelerate launch main.py --flags... & after running accelerate config and setting appropriate params. This created 4 experiment runs/directories which is not desirable.

My training loop is simple pytorch based loop and I am not using Trainer/torch lightning. What am I doing wrong? Is there a standard practice that I should follow?

Topic		Replies	Views
Stable diffusion `train_text_to_image.py` only on one gpu 🧨 Diffusers	5	1191	May 2, 2023
What does "--multi_gpu" do under the hood? (and how to use it) 🤗Accelerate	7	6497	May 31, 2023
Code terminates without training while using accelerate 🤗Accelerate	3	181	April 13, 2024
Single GPU is faster than multiple GPUs 🤗Accelerate	3	1962	January 31, 2024
HF Accelerate uses multiple GPUs even when setting `num_processes` to 1 🤗Accelerate	0	91	August 2, 2024

Training on multiple GPUs with multi file script

Related topics