Load_checkpoint_and_dispatch without heavy system memory usage

Ryulord · April 8, 2023, 4:37am

I’m trying to load llama-13b for inference on a system with 24GB VRAM and 32GB system memory using load_checkpoint_and_dispatch. The model should fit in the amount of combined memory I have but it looks like load_checkpoint_and_dispatch starts by trying to load the whole model into system memory at full precision before moving anything to GPU, causing me to run out of system memory. Is there any way around this or is this just a limitation of the current implementation? The model is sharded so it seems like it should be able to load shards and move them to GPU one at a time until the GPU is full and only then start loading the shards meant to stay in system memory.

Here’s my code:

checkpoint = "decapoda-research/llama-13b-hf"
model_index_path = hf_hub_download(checkpoint, "pytorch_model.bin.index.json")

tokenizer = LlamaTokenizer.from_pretrained(checkpoint)
with init_empty_weights():
    model = LlamaForCausalLM.from_pretrained(checkpoint, low_cpu_mem_usage=True).half()

device_map = infer_auto_device_map(
    model,
    max_memory={
        0: "20GiB",
        "cpu": "16GiB"
    },
)

model = load_checkpoint_and_dispatch(
    model,
    model_index_path,
    device_map=device_map,
    no_split_module_classes=["LlamaDecoderLayer"],
    dtype=torch.float16,
)

sgugger · April 10, 2023, 1:26pm

You should just use`

LlamaForCausalLM.from_pretrained(checkpoint, low_cpu_mem_usage=True, torch_dtype=torch.float16)

which will do the same thing.

load_checkpoint_and_dispatch will load the scheckpoint shard by shard, but it can’t work with the code sample you provided as it requires the shards to live in the same folder as the model index, and it doesn’t look like you are downloading them?

Also that checkpoint does not work at all (see the ~70 PRs opened being ignored), you should really use the conversion script on the official weights or use other checkpoints.

Topic		Replies	Views
Loadig the LLAMA 30B Model. Memory Issue Models	2	2087	July 27, 2023
Load a large model to multipe, specific GPUs (without CUDA_VISIBLE_DEVICES) 🤗Transformers	0	69	November 22, 2024
Unable to load a FineTuned LLama Model to GPU for inference Beginners	3	2818	December 15, 2023
Transformers Trainer + Accelerate FSDP: How do I load my model from a checkpoint? 🤗Accelerate	2	9569	January 17, 2024
Fetching all parameters from the checkpoint at /xx/xxx/llama/70B. Killed 🤗Transformers	1	631	August 31, 2023

Load_checkpoint_and_dispatch without heavy system memory usage

Related topics