"Out of memory" when loading quantized model

whistleroosh · January 20, 2024, 8:14pm

Hello. I am trying to load quantized model but I am constantly getting “CUDA: Out of memory” or out of RAM errors.

I am working on Google Collab V100 (51Gb RAM + 16Gb VRAM). Here is a code snippet:

config = AutoConfig.from_pretrained("deepseek-ai/deepseek-coder-33b-instruct", trust_remote_code=True)
with init_empty_weights():
  model = AutoModelForCausalLM.from_config(config)

weights_location = snapshot_download(
    repo_id="whistleroosh/deepseek-coder-33b-instruct-8bit",
    allow_patterns=["*.json", "*.safetensors"],
    ignore_patterns=["*.bin.index.json"]
)

bnb_quantization_config = BnbQuantizationConfig(
    load_in_8bit=True
)

vram = torch.cuda.get_device_properties(0).total_memory / (1024**3)
ram = psutil.virtual_memory().total / (1024**3)

model = load_and_quantize_model(
    model,
    weights_location=weights_location,
    device_map="auto",
    bnb_quantization_config=bnb_quantization_config,
    offload_folder="offload",
    offload_state_dict=True,
    no_split_module_classes=model._no_split_modules,
    max_memory={"cpu": f"{ram:.2f}GiB", 0: f"{vram:.2f}GiB"}
)

In repo “whistleroosh/deepseek-coder-33b-instruct-8bit” I have shards that were previously quantized with bitsandbytes to 8bit and saved using accelerate.save_model. Unless I decrease vram by 10 and ram by around 30 I will be getting errors that I am out of memory.

To my undestanding I should have no problems with memory as long as I can load the largest shard which has around 9Gb. But why do I need to limit memory by over half just to load the model?

Even after limiting max_memory and loading model I get “CUDA out of memory” when trying to run inference. Is there something wrong with my setup or my understanding of how this works?

whistleroosh · January 22, 2024, 4:39pm

I managed to load unquantized version of this model with load_checkpoint_and_dispatch. Even inference worked though it took 2 hours. Which means that even model as large as 33b can run on my setup.

So I believe there are differences between how load_checkpoint_and_dispatch and load_and_quantize_model loads the model. But I can’t load the quantized model with load_checkpoint_and_dispatch. When I try to do this I get:

“Only Tensors of floating point and complex dtype can require gradients”

Adding dtype=torch.float sometimes helps (yes, sometimes it works, sometimes not). But when it does load and I run inference I get:

“probability tensor contains either inf, nan or element < 0”

How should one correctly load the quantized model?

Topic		Replies	Views
How can I set `max_memory` parameter while loading Quantized model with Model Pipeline class? 🤗Transformers	2	51	March 18, 2025
Runtime error: CUDA out of memory, not sure if accelerate offloading is working Beginners	0	896	October 2, 2023
Load_in_8bit vs. loading 8-bit quantized model 🤗Transformers	6	6751	May 13, 2024
General question about large model loading 🤗Accelerate	2	924	November 28, 2024
Loading quantized model on CPU only 🤗Transformers	6	18537	February 3, 2025

"Out of memory" when loading quantized model

Related topics