Can I load a model into memory using fp16 or quantization, while run it using dynamically casted fp32 (because cpu doesn’t support fp16)?
I tried things like load_in_4bit=True
, load_in_8bit=True
, torch_dtype=torch.float16
but those doesn’t work.
I also tried offloading to disk, but that results in hanging my whole machine and I have to force reboot.
Any ideas?