How does compute/resource allocation work for hyperparam search?

Hi there!

(this is a duplicate to How does compute/resource allocation work for multi-node hypeparameter search? which hasn’t gotten any responses yet, so I’m raising it here again since it concerns accelerate and transformers equally)

I’ve been using accelerate to train models over multiple GPUs and nodes successfully (i.e., starting the runs using the accelerate command line interface). However, when I try to incorporate the trainer’s in-built hyperparameter search functionality, I get CUDA OOM errors; this is not really surprising, since the hyperparameter search appears to not run a single run per node, but multiple.

Thus my question: how does resource/node/gpu allocation work when running hyperparameter search in a multi-node/multi-gpu setting? And how can I influence this?
In an ideal world, I would be able to specify how many resources to assign per run - like “one whole node” or “2 gpus per node, and 16 threads”.

I am specifically hoping to use accelerate - I know that e.g., ray, can do this, but only after setting up a ray cluster, which I try to avoid.

Thank you!