Unable to load a FineTuned LLama Model to GPU for inference

I’m trying to test running my fine-tuned Llama2-7b-hf model. I have sucessfully test to tune it and had my model uploaded to the HuggingFace server, now for inference, am not able to load it into GPU, and instead loads it into the RAM itself, not the cuda device, even though I already specified it. Using the original machine does work, however

base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map=device_map,
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

This was the code that I used to merge the weight into the original model. The merge code worked, but now I can’t do inference since I cannot load it into my gpu.

I already tried .to(torch.device("cuda")) and it doesnt work either.

Were you able to find a solution to this? I have been struggling with this as well for a long time now.

Use Kaggle for testing the finetuned model as it provides 30GB of RAM and 2 T4 GPUs. also, try using !pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 and !pip install transformers==4.34.0, it helped me load the model partially onto the GPU.

My solution was to shard the merged model before uploading it Huggingface cloud. By doing this you can load the model into the memory with smaller chunks.