I’ve been using accelerate to train models over multiple GPUs and nodes successfully (i.e., starting the runs using the accelerate command line interface). However, when I try to incorporate the trainer’s in-built hyperparameter search functionality, I get CUDA OOM errors; this is not really surprising, since the hyperparameter search appears to not run a single run per node, but multiple.
Thus my question: how does resource/node/gpu allocation work when running hyperparameter search in a multi-node/multi-gpu setting? And how can I influence this?
In an ideal world, I would be able to specify how many resources to assign per run - like “one whole node” or “2 gpus per node, and 16 threads”.
I am specifically hoping to use accelerate - I know that e.g., ray, can do this, but only after setting up a ray cluster, which I try to avoid.