I’m trying to test running my fine-tuned Llama2-7b-hf model. I have sucessfully test to tune it and had my model uploaded to the HuggingFace server, now for inference, am not able to load it into GPU, and instead loads it into the RAM itself, not the cuda device, even though I already specified it. Using the original machine does work, however
base_model = AutoModelForCausalLM.from_pretrained(
model_name,
low_cpu_mem_usage=True,
return_dict=True,
torch_dtype=torch.float16,
device_map=device_map,
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()
# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
This was the code that I used to merge the weight into the original model. The merge code worked, but now I can’t do inference since I cannot load it into my gpu.
I already tried .to(torch.device("cuda"))
and it doesnt work either.