Hello. I am trying to load quantized model but I am constantly getting “CUDA: Out of memory” or out of RAM errors.
I am working on Google Collab V100 (51Gb RAM + 16Gb VRAM). Here is a code snippet:
config = AutoConfig.from_pretrained("deepseek-ai/deepseek-coder-33b-instruct", trust_remote_code=True)
with init_empty_weights():
model = AutoModelForCausalLM.from_config(config)
weights_location = snapshot_download(
repo_id="whistleroosh/deepseek-coder-33b-instruct-8bit",
allow_patterns=["*.json", "*.safetensors"],
ignore_patterns=["*.bin.index.json"]
)
bnb_quantization_config = BnbQuantizationConfig(
load_in_8bit=True
)
vram = torch.cuda.get_device_properties(0).total_memory / (1024**3)
ram = psutil.virtual_memory().total / (1024**3)
model = load_and_quantize_model(
model,
weights_location=weights_location,
device_map="auto",
bnb_quantization_config=bnb_quantization_config,
offload_folder="offload",
offload_state_dict=True,
no_split_module_classes=model._no_split_modules,
max_memory={"cpu": f"{ram:.2f}GiB", 0: f"{vram:.2f}GiB"}
)
In repo “whistleroosh/deepseek-coder-33b-instruct-8bit” I have shards that were previously quantized with bitsandbytes to 8bit and saved using accelerate.save_model. Unless I decrease vram by 10 and ram by around 30 I will be getting errors that I am out of memory.
To my undestanding I should have no problems with memory as long as I can load the largest shard which has around 9Gb. But why do I need to limit memory by over half just to load the model?
Even after limiting max_memory and loading model I get “CUDA out of memory” when trying to run inference. Is there something wrong with my setup or my understanding of how this works?