Loading a HF Model in Multiple GPUs and Run Inferences in those GPUs

Is there any way to load a Hugging Face model in multi GPUs and use those GPUs for inferences as well?

Like, there is this model which can be loaded on a single GPU (default cuda:0) and run for inference as below:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("togethercomputer/LLaMA-2-7B-32K")
model = AutoModelForCausalLM.from_pretrained("togethercomputer/LLaMA-2-7B-32K", torch_dtype=torch.float16)

input_context= "Your text here"
input_ids = tokenizer.encode(input_context, return_tensors="pt").to(model.device)
output = model.generate(input_ids, max_length=256, temperature=0.7)
output_text = tokenizer.decode(output[0], skip_special_tokens=True)

How should I load and run this model for inference on two or more GPUs using Accelerate or DeepSpeed?

Any guidance/help would be highly appreciated, thanks in anticipation!

Could you find a solution of this problem?
It didn’t work for me but did you try this on terminal? $accelerate launch model.py

Hi, this will not work since I have not used accelerator related stuffs in the code.

Oh okay. I have a code which in accelerator is used but I couldn’t run on that 2 gpu. I’m searching for a solution for that.

Can you share it with me? I will run it.

I’m triying to run demo.py file on 2 gpu because when I use 1 gpu it gives out of memory error. But I couldn’t run on multi gpu.

I found a solution and have posted it here.

It’ll spin up PyTorch properly to use DDP, so you can prepare the model that way if you want. Otherwise there’s a tutorial on distributed inference with accelerate here: Distributed Inference with 🤗 Accelerate

will this solution work for me? Should I change the accelerator to deep speed?

This works for both, you can use accelerate launch model.py instead of the deepspeed command.