How to load T0pp into 40Gb of GPU memory using mixed precisoin?

I was wondering about two things:

  1. How does :hugs: Spaces inferences T0pp so fast? For me, CPU inference takes dozens of seconds. So, probably, GPU? Some hacky/magic optimizations that Infinity has?
  2. How to load such a big model (42Gb checkpoint) into 40Gb of A100 ram (just for inference)?

If you use fp32, even weights will not fit into memory. Mixed precision can help here a lot, but I don’t see a way how to move model to the GPU in mixed precision.

I had an idea of individually transfering each parameter first to .cuda() and then to .half(), but then you need to manually specify which ones should stay in fp32 and basically write mixed precision from scratch. Sounds fun, but it would definitely take some time.

Is there any other method / existing tool that allows to do that? Maybe DeepSpeed Inference mode supports something similar?

Hey @dropout05 we now have sharded checkpoints and low memory usage in transformers to help handle these memory issues: Instantiating a big model

Thank you for a reply! I had this issue a while ago and I appreciate this new feature a lot. I hope this answer will be useful for somebody who randomly google the problem too :upside_down_face: