Hi everyone,
I am trying to run generation on multiple GPUs using codellama-13b model. Below is my code.
model_name = "codellama/CodeLlama-13b-hf"
cache_dir="/remote/CodeLlama/CACHE/"
device = "cuda:3"
max_memory = {
0: "16GiB",
2: "16GiB",
3: "16GiB"
}
tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir=cache_dir)
config = AutoConfig.from_pretrained(model_name, cache_dir=cache_dir)
with init_empty_weights():
model = AutoModelForCausalLM.from_config(config)
device_map = infer_auto_device_map(model, max_memory=max_memory)
model = AutoModelForCausalLM.from_pretrained(
model_name, cache_dir=cache_dir, device_map=device_map) # what should I add here?
prompt = "Explain following Verilog code: \n" + code
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
outputs = model.generate(
input_ids,
temperature=1,
top_k=50,
top_p=1,
repetition_penalty=1.1,
max_new_tokens=500,
min_new_tokens=3,
pad_token_id=tokenizer.eos_token_id,
do_sample=True,
num_return_sequences=20
)
samples = [tokenizer.decode(output) for output in outputs]
print("Completion: \n", samples[0])
When using multi GPU device_map
…
If I don’t set model.to(device)
it shows torch.cuda.OutOfMemoryError:
.
If I set model.to(device)
it shows the below error.
RuntimeError: Expected all tensors to be on the same device, \
but found at least two devices, cuda:0 and cuda:3! \
(when checking argument for argument index in method wrapper_CUDA__index_select)
I have a few questions.
Do I need to set model.to(device)
if I use device_map
dict?
What happens when I use the device as "cuda"
and not as "cuda:[0-9]"
, is the model only going to "cuda:0"
?
When using multi GPU what should I do to the tokens? Send them to one GPU or don’t send them to a GPU
tokenizer(prompt, return_tensors="pt").input_ids.to(device)
Thank you