Why the model loading of llama2 is so slow?

It took me about 1 hour to load the model of llama2-7b-hf. It’s such weird. What can I do to resolve this issue?
The code is attached as follows:

from transformers import AutoModelForCausalLM
model_dir = "meta-llama/Llama-2-7b-chat-hf"
model = AutoModelForCausalLM.from_pretrained(

Issue solved. It’s the disk problem, I copy the model to a “close” disk and the loading time reduce to 7~8 minutes.

Can you explain what “close” disk refers to. Actually I am also facing the similar kind of issue. I am using ml.g5.12xlarge to infer llama2 model. I downloaded the model locally using snapshot_download method. But model loading is taking more than 30 minutes.

hi @philschmid, I hope you are doing well. Sorry for fine tuning llama2, I create csv file with the Alpaca structure which has text column including ### instruction ### input ### response, for fine tuning the model I am confused which method with PEFT and QLora should I use, I am confused with many codes, would you please refer me to any code that is right for fine tuning with alpaca structure, and saving and inference for testing the model? In some code I saw they did tokenizer truncate and padding and refer label to -100 and in other no preprocessing is done. I appreciate your help. Many thanks.

Hey, did you find a solution? I was loading the model for max 3min and out of nowhere it takes more than 30 mins.


I solved this by turning local_files_only to true. (Note: I had previously downloaded LLaMA 2)
From loading in around 3.2 hours, it loaded within around 29 seconds after changing this.

model = AutoModelForCausalLM.from_pretrained(
    local_files_only = True

@lepotatoguy , can you please provide bnb_config and device_map values in details?