Getting error when running inference in multiple GPUs

nalakar · October 13, 2023, 10:03pm

Hi everyone,
I am trying to run generation on multiple GPUs using codellama-13b model. Below is my code.

model_name = "codellama/CodeLlama-13b-hf"
cache_dir="/remote/CodeLlama/CACHE/"
device = "cuda:3"

max_memory = {
    0: "16GiB",
    2: "16GiB",
    3: "16GiB"
}

tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir=cache_dir)
config = AutoConfig.from_pretrained(model_name, cache_dir=cache_dir)
with init_empty_weights():
    model = AutoModelForCausalLM.from_config(config)
    device_map = infer_auto_device_map(model, max_memory=max_memory)

model = AutoModelForCausalLM.from_pretrained(
    model_name, cache_dir=cache_dir, device_map=device_map) # what should I add here?

prompt = "Explain following Verilog code: \n" + code
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
outputs = model.generate(
    input_ids,
    temperature=1,
    top_k=50,
    top_p=1,
    repetition_penalty=1.1,
    max_new_tokens=500,
    min_new_tokens=3,
    pad_token_id=tokenizer.eos_token_id,
    do_sample=True,
    num_return_sequences=20
)
samples = [tokenizer.decode(output) for output in outputs]
print("Completion: \n", samples[0])

When using multi GPU device_map…
If I don’t set model.to(device) it shows torch.cuda.OutOfMemoryError:.
If I set model.to(device) it shows the below error.

RuntimeError: Expected all tensors to be on the same device, \
but found at least two devices, cuda:0 and cuda:3! \
(when checking argument for argument index in method wrapper_CUDA__index_select)

I have a few questions.
Do I need to set model.to(device) if I use device_map dict?
What happens when I use the device as "cuda" and not as "cuda:[0-9]", is the model only going to "cuda:0"?
When using multi GPU what should I do to the tokens? Send them to one GPU or don’t send them to a GPU

tokenizer(prompt, return_tensors="pt").input_ids.to(device)

Thank you

Topic		Replies	Views
Why transformers doesn't use Multiple GPUs (to increase tokens per second)? Beginners	7	611	September 22, 2024
How to generate with a single gpu when a model is loaded onto multiple gpus? Beginners	0	882	February 9, 2024
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! 🤗Transformers	28	113647	November 17, 2024
Multi-GPU inference with LLM produces gibberish 🤗Transformers	14	6577	September 28, 2024
If I use llama 70b and 7b for speculative decoding, how should I put them on my multiple gpus in the code 🤗Transformers	0	47	October 11, 2024

Getting error when running inference in multiple GPUs

Related topics