"Out of memory" when loading quantized model

I managed to load unquantized version of this model with load_checkpoint_and_dispatch. Even inference worked though it took 2 hours. Which means that even model as large as 33b can run on my setup.

So I believe there are differences between how load_checkpoint_and_dispatch and load_and_quantize_model loads the model. But I can’t load the quantized model with load_checkpoint_and_dispatch. When I try to do this I get:

“Only Tensors of floating point and complex dtype can require gradients”

Adding dtype=torch.float sometimes helps (yes, sometimes it works, sometimes not). But when it does load and I run inference I get:

“probability tensor contains either inf, nan or element < 0”

How should one correctly load the quantized model?