Can I load a model into memory using fp16 or quantization, while run it using dynamically casted fp32 (because cpu doesn’t support fp16)?
I tried things like load_in_4bit=True, load_in_8bit=True, torch_dtype=torch.float16 but those doesn’t work.
I also tried offloading to disk, but that results in hanging my whole machine and I have to force reboot.
Any ideas?