Getting error when running inference in multiple GPUs

Hi everyone,
I am trying to run generation on multiple GPUs using codellama-13b model. Below is my code.

model_name = "codellama/CodeLlama-13b-hf"
cache_dir="/remote/CodeLlama/CACHE/"
device = "cuda:3"

max_memory = {
    0: "16GiB",
    2: "16GiB",
    3: "16GiB"
}

tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir=cache_dir)
config = AutoConfig.from_pretrained(model_name, cache_dir=cache_dir)
with init_empty_weights():
    model = AutoModelForCausalLM.from_config(config)
    device_map = infer_auto_device_map(model, max_memory=max_memory)

model = AutoModelForCausalLM.from_pretrained(
    model_name, cache_dir=cache_dir, device_map=device_map) # what should I add here?

prompt = "Explain following Verilog code: \n" + code
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
outputs = model.generate(
    input_ids,
    temperature=1,
    top_k=50,
    top_p=1,
    repetition_penalty=1.1,
    max_new_tokens=500,
    min_new_tokens=3,
    pad_token_id=tokenizer.eos_token_id,
    do_sample=True,
    num_return_sequences=20
)
samples = [tokenizer.decode(output) for output in outputs]
print("Completion: \n", samples[0])

When using multi GPU device_map…
If I don’t set model.to(device) it shows torch.cuda.OutOfMemoryError:.
If I set model.to(device) it shows the below error.

RuntimeError: Expected all tensors to be on the same device, \
but found at least two devices, cuda:0 and cuda:3! \
(when checking argument for argument index in method wrapper_CUDA__index_select)

I have a few questions.
Do I need to set model.to(device) if I use device_map dict?
What happens when I use the device as "cuda" and not as "cuda:[0-9]", is the model only going to "cuda:0"?
When using multi GPU what should I do to the tokens? Send them to one GPU or don’t send them to a GPU

tokenizer(prompt, return_tensors="pt").input_ids.to(device)

Thank you