"Out of memory" when loading quantized model

Hello. I am trying to load quantized model but I am constantly getting “CUDA: Out of memory” or out of RAM errors.

I am working on Google Collab V100 (51Gb RAM + 16Gb VRAM). Here is a code snippet:

config = AutoConfig.from_pretrained("deepseek-ai/deepseek-coder-33b-instruct", trust_remote_code=True)
with init_empty_weights():
  model = AutoModelForCausalLM.from_config(config)

weights_location = snapshot_download(
    repo_id="whistleroosh/deepseek-coder-33b-instruct-8bit",
    allow_patterns=["*.json", "*.safetensors"],
    ignore_patterns=["*.bin.index.json"]
)

bnb_quantization_config = BnbQuantizationConfig(
    load_in_8bit=True
)

vram = torch.cuda.get_device_properties(0).total_memory / (1024**3)
ram = psutil.virtual_memory().total / (1024**3)

model = load_and_quantize_model(
    model,
    weights_location=weights_location,
    device_map="auto",
    bnb_quantization_config=bnb_quantization_config,
    offload_folder="offload",
    offload_state_dict=True,
    no_split_module_classes=model._no_split_modules,
    max_memory={"cpu": f"{ram:.2f}GiB", 0: f"{vram:.2f}GiB"}
)

In repo “whistleroosh/deepseek-coder-33b-instruct-8bit” I have shards that were previously quantized with bitsandbytes to 8bit and saved using accelerate.save_model. Unless I decrease vram by 10 and ram by around 30 I will be getting errors that I am out of memory.

To my undestanding I should have no problems with memory as long as I can load the largest shard which has around 9Gb. But why do I need to limit memory by over half just to load the model?

Even after limiting max_memory and loading model I get “CUDA out of memory” when trying to run inference. Is there something wrong with my setup or my understanding of how this works?

I managed to load unquantized version of this model with load_checkpoint_and_dispatch. Even inference worked though it took 2 hours. Which means that even model as large as 33b can run on my setup.

So I believe there are differences between how load_checkpoint_and_dispatch and load_and_quantize_model loads the model. But I can’t load the quantized model with load_checkpoint_and_dispatch. When I try to do this I get:

“Only Tensors of floating point and complex dtype can require gradients”

Adding dtype=torch.float sometimes helps (yes, sometimes it works, sometimes not). But when it does load and I run inference I get:

“probability tensor contains either inf, nan or element < 0”

How should one correctly load the quantized model?