On cpu, how to save memory when inferencing?

Can I load a model into memory using fp16 or quantization, while run it using dynamically casted fp32 (because cpu doesn’t support fp16)?

I tried things like load_in_4bit=True, load_in_8bit=True, torch_dtype=torch.float16 but those doesn’t work.

I also tried offloading to disk, but that results in hanging my whole machine and I have to force reboot.

Any ideas?

Well, I found one way is to use torch_dtype=torch.bfloat16. That seems to be supported on CPU instead of normal half