Unable to load a FineTuned LLama Model to GPU for inference

bobbybelajar · September 25, 2023, 12:23pm

I’m trying to test running my fine-tuned Llama2-7b-hf model. I have sucessfully test to tune it and had my model uploaded to the HuggingFace server, now for inference, am not able to load it into GPU, and instead loads it into the RAM itself, not the cuda device, even though I already specified it. Using the original machine does work, however

base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map=device_map,
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

This was the code that I used to merge the weight into the original model. The merge code worked, but now I can’t do inference since I cannot load it into my gpu.

I already tried .to(torch.device("cuda")) and it doesnt work either.

vishalbhardwaj99 · December 14, 2023, 4:18am

Were you able to find a solution to this? I have been struggling with this as well for a long time now.

abbbinav · December 15, 2023, 8:54am

Use Kaggle for testing the finetuned model as it provides 30GB of RAM and 2 T4 GPUs. also, try using !pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 and !pip install transformers==4.34.0, it helped me load the model partially onto the GPU.

bobbybelajar · December 15, 2023, 9:11am

My solution was to shard the merged model before uploading it Huggingface cloud. By doing this you can load the model into the memory with smaller chunks.

Topic		Replies	Views
Trying the inference with model Llama-2-70b-hf on 2 A100 (80g) GPUs but getting errors Beginners	6	6608	November 28, 2023
Loading a HF Model in Multiple GPUs and Run Inferences in those GPUs 🤗Accelerate	10	9599	October 16, 2024
[SOLVED] What's the right way to do GPU paralellism for inference (not training) on AutoModelForCausalLM? 🤗Transformers	1	224	August 26, 2024
How to load the finetuned model (merged weights) on colab? 🤗Transformers	1	1491	November 27, 2023
Can't load fine tuned LLamav2 7b Beginners	2	1112	October 13, 2023

Unable to load a FineTuned LLama Model to GPU for inference

Related topics