Loading a HF Model in Multiple GPUs and Run Inferences in those GPUs

asifhugs · August 13, 2023, 6:21pm

Hi,
Is there any way to load a Hugging Face model in multi GPUs and use those GPUs for inferences as well?

Like, there is this model which can be loaded on a single GPU (default cuda:0) and run for inference as below:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("togethercomputer/LLaMA-2-7B-32K")
model = AutoModelForCausalLM.from_pretrained("togethercomputer/LLaMA-2-7B-32K", torch_dtype=torch.float16)

input_context= "Your text here"
input_ids = tokenizer.encode(input_context, return_tensors="pt").to(model.device)
output = model.generate(input_ids, max_length=256, temperature=0.7)
output_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(output_text)

How should I load and run this model for inference on two or more GPUs using Accelerate or DeepSpeed?

Any guidance/help would be highly appreciated, thanks in anticipation!

thedaffodil · August 14, 2023, 9:30pm

Could you find a solution of this problem?
It didn’t work for me but did you try this on terminal? $accelerate launch model.py

asifhugs · August 15, 2023, 8:25am

Hi, this will not work since I have not used accelerator related stuffs in the code.

thedaffodil · August 15, 2023, 11:12am

Oh okay. I have a code which in accelerator is used but I couldn’t run on that 2 gpu. I’m searching for a solution for that.

asifhugs · August 15, 2023, 12:09pm

Can you share it with me? I will run it.

thedaffodil · August 15, 2023, 12:23pm

I’m triying to run demo.py file on 2 gpu because when I use 1 gpu it gives out of memory error. But I couldn’t run on multi gpu.

asifhugs · August 15, 2023, 1:13pm

I found a solution and have posted it here.

muellerzr · August 15, 2023, 1:18pm

It’ll spin up PyTorch properly to use DDP, so you can prepare the model that way if you want. Otherwise there’s a tutorial on distributed inference with accelerate here: Distributed Inference with 🤗 Accelerate

thedaffodil · August 15, 2023, 2:23pm

will this solution work for me? Should I change the accelerator to deep speed?

asifhugs · August 15, 2023, 5:01pm

This works for both, you can use accelerate launch model.py instead of the deepspeed command.

jplazag · October 16, 2024, 12:04pm

Hi, I’m having a similar problem but I could not apply the tutorial you’re referencing here because I’m working with Diffusers pipelines and I do not get the model isolated to use “prepare_pippy” as the tutorial says. Do you know if there is any possibility to isolate the model from the pipeline so I can follow that tutorial or should I use another approach?

Thanks in advance.

Topic		Replies	Views
Trying the inference with model Llama-2-70b-hf on 2 A100 (80g) GPUs but getting errors Beginners	6	6622	November 28, 2023
Unable to load a FineTuned LLama Model to GPU for inference Beginners	3	2977	December 15, 2023
Why transformers doesn't use Multiple GPUs (to increase tokens per second)? Beginners	7	602	September 22, 2024
Fine-tunning llama2 with multiple GPU hugging face trainer 🤗Transformers	8	3370	March 7, 2024
How to generate with a single gpu when a model is loaded onto multiple gpus? Beginners	0	882	February 9, 2024

Loading a HF Model in Multiple GPUs and Run Inferences in those GPUs

Related topics