"Out of memory" when loading quantized model

whistleroosh · January 22, 2024, 4:39pm

I managed to load unquantized version of this model with load_checkpoint_and_dispatch. Even inference worked though it took 2 hours. Which means that even model as large as 33b can run on my setup.

So I believe there are differences between how load_checkpoint_and_dispatch and load_and_quantize_model loads the model. But I can’t load the quantized model with load_checkpoint_and_dispatch. When I try to do this I get:

“Only Tensors of floating point and complex dtype can require gradients”

Adding dtype=torch.float sometimes helps (yes, sometimes it works, sometimes not). But when it does load and I run inference I get:

“probability tensor contains either inf, nan or element < 0”

How should one correctly load the quantized model?

Topic		Replies	Views
How can I set `max_memory` parameter while loading Quantized model with Model Pipeline class? 🤗Transformers	2	62	March 18, 2025
Runtime error: CUDA out of memory, not sure if accelerate offloading is working Beginners	0	905	October 2, 2023
Load_in_8bit vs. loading 8-bit quantized model 🤗Transformers	6	6990	May 13, 2024
General question about large model loading 🤗Accelerate	2	940	November 28, 2024
Loading quantized model on CPU only 🤗Transformers	6	18816	February 3, 2025

"Out of memory" when loading quantized model

Related topics