Hi, I’m using the Accelerate framework to offload the weight parameters to CPU DRAM for DNN inference.
To achieve this, I’m referring to Accelerate’s device_map, which can be found at this link.
Handling big models for inference.
However, I recently came across another document discussing DeepSpeed’s Zero-3 offload, which seems to offer a similar function.
I’m wondering if these two approaches are the same or if there are any differences between them.
Specifically, am I using DeepSpeed just giving device_map when calling the pretrained model?
Hello, no, they are both different.
device_map is doing naive pipelining (different layers on different GPUs/CPU RAM/disk) while DeepSpeed does parameter+optimizer+gradient sharding across GPUs and then offloading those to partitions to CPUs. DeepSpeed Z3 is generally used for training. Accelerate’s
device_map is generally used for big model inference.
Oh yes, I know there’s a far more difference between just offloading parameters from GPU to CPU when training.
But I’m just using it within inference execution.
As far as I know, there’s no more optimization on DeepSpeed ZeRO-3 just offloading parameters to CPU DRAM, so I thought those two are the same. Isn’t it?
During inference too, there is clear difference between ZeRO-3 and device_map/naive_pipelining. ZeRO-3 inferences of different mini-batch on each of the GPUs leading to higher throughput whereas device_map infers the same batch while jumping across GPUs leading to lesser throughput.
Yes, I think I have read that sentence in the device_map document, but when it comes to single GPU and fixed batch size than don’t we conclude that two are the same?