How does compute/resource allocation work for multi-node hypeparameter search?

c-r-p72 · January 23, 2024, 5:27pm

Hi there!

I’ve been using accelerate to train models over multiple GPUs and nodes successfully (i.e., starting the runs using the accelerate command line interface). However, when I try to incorporate the trainer’s in-built hyperparameter search functionality, I get CUDA OOM errors; this is not really surprising, since the hyperparameter search appears to not run a single run per node, but multiple.

Thus my question: how does resource/node/gpu allocation work when running hyperparameter search in a multi-node/multi-gpu setting? And how can I influence this?
In an ideal world, I would be able to specify how many resources to assign per run - like “one whole node” or “2 gpus per node, and 16 threads”.

I am specifically hoping to use accelerate - I know that e.g., ray, can do this, but only after setting up a ray cluster, which I try to avoid.

Thank you!

Topic		Replies	Views
How does compute/resource allocation work for hyperparam search? 🤗Transformers	0	105	March 7, 2024
Multi-node training 🤗Accelerate	2	2936	January 16, 2023
How to use raytune to do distributed hyper-parameter tuning? 🤗Transformers	0	365	February 14, 2022
Accelerate Multi-GPU on several Nodes How to 🤗Accelerate	3	6234	October 13, 2021
Detecting single gpu within each node 🤗Accelerate	2	756	January 17, 2023

How does compute/resource allocation work for multi-node hypeparameter search?

Related topics