I managed to load unquantized version of this model with load_checkpoint_and_dispatch
. Even inference worked though it took 2 hours. Which means that even model as large as 33b can run on my setup.
So I believe there are differences between how load_checkpoint_and_dispatch
and load_and_quantize_model
loads the model. But I can’t load the quantized model with load_checkpoint_and_dispatch
. When I try to do this I get:
“Only Tensors of floating point and complex dtype can require gradients”
Adding dtype=torch.float
sometimes helps (yes, sometimes it works, sometimes not). But when it does load and I run inference I get:
“probability tensor contains either inf
, nan
or element < 0”
How should one correctly load the quantized model?