How to All Utilize all GPU's when device="balanced_low_0" in GPU setting

kmukeshreddy · March 28, 2024, 8:33pm

While loading the model in “balanced_low_0” GPU setting the model is loaded into all GPU’s apart from 0: GPU. Where the 0: GPU is left to do the text inference. (i.e. text inference as in performing all the calculation to generate response inside the LLM)

So, as per the give device parameter my model is loaded onto 1,2,3 GPU’s and 0: GPU is left for inference.

| ID | GPU | MEM |
| 0 | 0% | 3% |
| 1 | 0% | 83% |
| 2 | 0% | 82% |
| 3 | 0% | 76% |

Question: How can i also utilize the remaining 1,2,3 GPU’s to perform text inference not only 0:GPU?

Context: “balanced_low_0” evenly splits the model on all GPUs except the first one, and only puts on GPU 0 what does not fit on the others. This option is great when you need to use GPU 0 for some processing of the outputs, like when using the generate function for Transformers models

Reference: Handling big models for inference

kmukeshreddy · March 28, 2024, 8:34pm

@joaogante @ybelkada

Topic		Replies	Views
Inference on multi GPUs Research	2	223	May 1, 2025
How to generate with a single gpu when a model is loaded onto multiple gpus? Beginners	0	882	February 9, 2024
Loading a HF Model in Multiple GPUs and Run Inferences in those GPUs 🤗Accelerate	10	9605	October 16, 2024
How to load model on multiple GPUs for inference? Beginners	0	726	September 28, 2023
How can we maximize the GPU utilization in Inference Endpoints? Inference Endpoints on the Hub	1	2254	July 20, 2023

How to All Utilize all GPU's when device="balanced_low_0" in GPU setting

Related topics