Loading a locally saved model is very slow

Hello all, I am loading a llama 2 model from HF hub on g5.48xlarge on sagemaker notebook instance using the below commands and it take less than 5 minutes to completes the loading

model_id = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(model_id,load_in_8bit=True, torch_dtype=torch.float16)

Thereafter I save the model locally using the below code

save_directory = "models/Llama-2-7b-chat-hf"
model.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory)

then I try loading the locally saved model using the below code

model = AutoModelForCausalLM.from_pretrained(save_directory, device_map = 'auto')
tokenizer = AutoTokenizer.from_pretrained(save_directory)

but it took around 30 minutes to load the same model from local as compared to less than 5 minute from HF hub, what can be the reason behind it and how can I load the model faster.

Any answer would be appreciated, Thanks.

Hi there, did you ever find a solution for this? - having the same issues here. Have run this code:
config = AutoConfig.from_pretrained(storage_model_path)

tokenizer = AutoTokenizer.from_pretrained(storage_model_path)

model = AutoModelForCausalLM.from_pretrained(
storage_model_path,
config=config,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
local_files_only=True,
)

And it still takes 30 mins as opposed to 45 seconds when loading from the hub directly.

Env requirements:
transformers==4.41.2
torch==2.2.2
requests==2.31.0
accelerate==0.31.0

Using Databricks 14.3 ML cluster with cuda version 11.8 - not sure if its a read bits per second setting or on the transformers side?

Would appreciate if anyone has a fix for this?