How to generate with a single gpu when a model is loaded onto multiple gpus?

arielflintashery · February 9, 2024, 11:13am

Hello,

I am currently using the llama 2 7b chat model. I am trying to run inference on inputs with very high token size, so my thoughts were to distribute the model across multiple gpus, and run inference and generation only on one of them.

Having read the documentation on handing big models , I tried doing this using AutoModelForCausalLM.from_pretrained(model_id, device_map='balanced_low_0).

I tokenize using
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokens = tokenizer.encode(prompt).to('cuda:0')

However, when I try model.generate(tokens, ...) I get the error
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!”

Can anyone help me run model.generate on only one gpu (‘cuda:0’) while storing the model on the rest of the gpus?

Using 2 RTX6000

Topic		Replies	Views
Loading a HF Model in Multiple GPUs and Run Inferences in those GPUs 🤗Accelerate	10	9626	October 16, 2024
Multi-gpu inference llama-3.2 vision with QLoRA 🤗Accelerate	4	112	April 25, 2025
Why transformers doesn't use Multiple GPUs (to increase tokens per second)? Beginners	7	604	September 22, 2024
Multi-GPU LLM inference data parallelism (llama) Beginners	1	14191	October 25, 2023
Getting error when running inference in multiple GPUs 🤗Transformers	0	648	October 13, 2023

How to generate with a single gpu when a model is loaded onto multiple gpus?

Related topics