Training on multiple GPUs with multi file script

I have a training script that takes the training arguments, creates a directory for the experiment run, processes the annotations from the files passed and trains a DETR model. My dataset class is custom and inherits the torch Dataset class. These things are handled from a main script which is the entrypoint.

I moved this configuration to use accelerate to train on multiple GPUs. I followed this tutorial and changed the relevant parts. I started the training with nohup python3 main.py --flags... &. This uses only 1 GPU out of 4 and printing accelerator.num_processes returns 1.

I tried running with nohup accelerate launch main.py --flags... & after running accelerate config and setting appropriate params. This created 4 experiment runs/directories which is not desirable.

My training loop is simple pytorch based loop and I am not using Trainer/torch lightning. What am I doing wrong? Is there a standard practice that I should follow?