I was wondering about two things:
- How does
Spaces inferences T0pp so fast? For me, CPU inference takes dozens of seconds. So, probably, GPU? Some hacky/magic optimizations that Infinity has?
- How to load such a big model (42Gb checkpoint) into 40Gb of A100 ram (just for inference)?
If you use fp32, even weights will not fit into memory. Mixed precision can help here a lot, but I don’t see a way how to move model to the GPU in mixed precision.
I had an idea of individually transfering each parameter first to .cuda()
and then to .half()
, but then you need to manually specify which ones should stay in fp32 and basically write mixed precision from scratch. Sounds fun, but it would definitely take some time.
Is there any other method / existing tool that allows to do that? Maybe DeepSpeed Inference mode supports something similar?