Hi @SDryluth,
I am able to load the model this way:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig, file_utils
from accelerate import init_empty_weights, load_checkpoint_and_dispatch
pretrained_model_dir = 'mosaicml/mpt-7b'
pretrained_model_cache_dir = "/home/user/.cache/huggingface/hub/models--mosaicml--mpt-7b/snapshots/d8304854d4877849c3c0a78f3469512a84419e84/"
config = AutoConfig.from_pretrained(pretrained_model_dir, trust_remote_code=True, torch_dtype=torch.float16)
with init_empty_weights():
model = AutoModelForCausalLM.from_config(config, trust_remote_code=True, torch_dtype=torch.float16)
max_memory = {0: "10GiB", "cpu": "80GiB"}
model = load_checkpoint_and_dispatch(
model, pretrained_model_cache_dir, device_map="auto", max_memory=max_memory, dtype=torch.float16
)
I only have one 12GB VRAM GPU, so I am loading the rest of the model in CPU, but perhaps you can modify your max_memory
dict to include 2nd GPU and see if this works.