Hi there, did you ever find a solution for this? - having the same issues here. Have run this code:
config = AutoConfig.from_pretrained(storage_model_path)
tokenizer = AutoTokenizer.from_pretrained(storage_model_path)
model = AutoModelForCausalLM.from_pretrained(
storage_model_path,
config=config,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
local_files_only=True,
)
And it still takes 30 mins as opposed to 45 seconds when loading from the hub directly.
Env requirements:
transformers==4.41.2
torch==2.2.2
requests==2.31.0
accelerate==0.31.0
Using Databricks 14.3 ML cluster with cuda version 11.8 - not sure if its a read bits per second setting or on the transformers side?
Would appreciate if anyone has a fix for this?