On cpu, how to save memory when inferencing?

zisisnotzis · July 13, 2023, 8:12am

Can I load a model into memory using fp16 or quantization, while run it using dynamically casted fp32 (because cpu doesn’t support fp16)?

I tried things like load_in_4bit=True, load_in_8bit=True, torch_dtype=torch.float16 but those doesn’t work.

I also tried offloading to disk, but that results in hanging my whole machine and I have to force reboot.

Any ideas?

zisisnotzis · July 13, 2023, 9:31am

Well, I found one way is to use torch_dtype=torch.bfloat16. That seems to be supported on CPU instead of normal half

Topic		Replies	Views
Question about memory usage Beginners	0	910	May 15, 2023
Want to use CPU for falcon7b Beginners	0	313	June 22, 2023
Loading in Float32 vs Float16 has very different speed 🤗Transformers	1	114	February 20, 2025
How can I set `max_memory` parameter while loading Quantized model with Model Pipeline class? 🤗Transformers	2	49	March 18, 2025
How to run Phi-1_5 on cpu? 🤗Transformers	1	629	December 30, 2023