How does compute/resource allocation work for hyperparam search?

c-r-p72 · March 7, 2024, 1:40pm

Hi there!

(this is a duplicate to How does compute/resource allocation work for multi-node hypeparameter search? which hasn’t gotten any responses yet, so I’m raising it here again since it concerns accelerate and transformers equally)

I’ve been using accelerate to train models over multiple GPUs and nodes successfully (i.e., starting the runs using the accelerate command line interface). However, when I try to incorporate the trainer’s in-built hyperparameter search functionality, I get CUDA OOM errors; this is not really surprising, since the hyperparameter search appears to not run a single run per node, but multiple.

Thus my question: how does resource/node/gpu allocation work when running hyperparameter search in a multi-node/multi-gpu setting? And how can I influence this?
In an ideal world, I would be able to specify how many resources to assign per run - like “one whole node” or “2 gpus per node, and 16 threads”.

I am specifically hoping to use accelerate - I know that e.g., ray, can do this, but only after setting up a ray cluster, which I try to avoid.

Thank you!

Topic		Replies	Views
How does compute/resource allocation work for multi-node hypeparameter search? 🤗Accelerate	0	187	January 23, 2024
How to use raytune to do distributed hyper-parameter tuning? 🤗Transformers	0	365	February 14, 2022
CUDA out memory only when performing hyperparameter search 🤗Transformers	1	942	January 26, 2022
Multi-node training 🤗Accelerate	2	2941	January 16, 2023
Does Trainer hyperparameter search support deepspeed? Beginners	0	215	July 10, 2023

How does compute/resource allocation work for hyperparam search?

Related topics