How to load T0pp into 40Gb of GPU memory using mixed precisoin?

dropout05 · November 17, 2021, 10:55pm

I was wondering about two things:

How does Spaces inferences T0pp so fast? For me, CPU inference takes dozens of seconds. So, probably, GPU? Some hacky/magic optimizations that Infinity has?
How to load such a big model (42Gb checkpoint) into 40Gb of A100 ram (just for inference)?

If you use fp32, even weights will not fit into memory. Mixed precision can help here a lot, but I don’t see a way how to move model to the GPU in mixed precision.

I had an idea of individually transfering each parameter first to .cuda() and then to .half(), but then you need to manually specify which ones should stay in fp32 and basically write mixed precision from scratch. Sounds fun, but it would definitely take some time.

Is there any other method / existing tool that allows to do that? Maybe DeepSpeed Inference mode supports something similar?

lewtun · July 21, 2022, 7:50pm

Hey @dropout05 we now have sharded checkpoints and low memory usage in transformers to help handle these memory issues: Instantiating a big model

dropout05 · July 21, 2022, 11:58pm

Thank you for a reply! I had this issue a while ago and I appreciate this new feature a lot. I hope this answer will be useful for somebody who randomly google the problem too

Topic		Replies	Views
How much memory required to load T0pp Models	4	3710	October 20, 2021
Deepspeed ZeRO Inference DeepSpeed	1	2739	November 24, 2021
Load a single GPU checkpoint to 2 GPUS (deepspeed) Intermediate	0	2005	June 29, 2022
Fitting huge models on multiple nodes DeepSpeed	0	165	September 6, 2024
Question about memory usage Beginners	0	918	May 15, 2023

How to load T0pp into 40Gb of GPU memory using mixed precisoin?

Related topics