I’m trying to load llama-13b for inference on a system with 24GB VRAM and 32GB system memory using load_checkpoint_and_dispatch. The model should fit in the amount of combined memory I have but it looks like load_checkpoint_and_dispatch starts by trying to load the whole model into system memory at full precision before moving anything to GPU, causing me to run out of system memory. Is there any way around this or is this just a limitation of the current implementation? The model is sharded so it seems like it should be able to load shards and move them to GPU one at a time until the GPU is full and only then start loading the shards meant to stay in system memory.
Here’s my code:
checkpoint = "decapoda-research/llama-13b-hf"
model_index_path = hf_hub_download(checkpoint, "pytorch_model.bin.index.json")
tokenizer = LlamaTokenizer.from_pretrained(checkpoint)
with init_empty_weights():
model = LlamaForCausalLM.from_pretrained(checkpoint, low_cpu_mem_usage=True).half()
device_map = infer_auto_device_map(
model,
max_memory={
0: "20GiB",
"cpu": "16GiB"
},
)
model = load_checkpoint_and_dispatch(
model,
model_index_path,
device_map=device_map,
no_split_module_classes=["LlamaDecoderLayer"],
dtype=torch.float16,
)