Loading a locally saved model is very slow

Hi there, did you ever find a solution for this? - having the same issues here. Have run this code:
config = AutoConfig.from_pretrained(storage_model_path)

tokenizer = AutoTokenizer.from_pretrained(storage_model_path)

model = AutoModelForCausalLM.from_pretrained(
storage_model_path,
config=config,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
local_files_only=True,
)

And it still takes 30 mins as opposed to 45 seconds when loading from the hub directly.

Env requirements:
transformers==4.41.2
torch==2.2.2
requests==2.31.0
accelerate==0.31.0

Using Databricks 14.3 ML cluster with cuda version 11.8 - not sure if its a read bits per second setting or on the transformers side?

Would appreciate if anyone has a fix for this?