Deepspeed ZeRO Inference

@stas ,
I saw your tweet and had two questions.

  1. Does using deepspeed inference require consolidating the files in the checkpoint dir or it performs that for you and then reshards them across available GPUs?

  2. Does this now support model > GPU RAM or is the comment in the doc still accurate:

Additionally DeepSpeed is currently developing a related product called Deepspeed-Inference which has no relationship to the ZeRO technology, but instead uses tensor parallelism to scale models that can’t fit onto a single GPU. This is a work in progress and we will provide the integration once that product is complete.

In general I am trying to figure out if there exists a sharded checkpoint format that allows either (a) inference or (b) resuming training on a different world size.

Deepspeed ZeRO Inference is the same as ZeRO Training except it doesn’t allocate optimizer and lr scheduler and that it requires ZeRO-3.

Therefore it always supports model > single gpu RAM.

During Training it indeed saves a sharded state checkpoint.

During Inference it doesn’t need to do that. While it can load from a sharded trained-with-ZeRO checkpoint - It is meant to be used with a normal single file checkpoint. It uses zero.Init to split the weights to different GPUs while loading the model.

When we do normal training, we need to do validation, so inference happens there already.

So yes, DeepSpeed uses a sharded checkpoint format.

I don’t think you can currently use a sharded checkpoint for N gpus and be able to load it on M gpus automatically.

I wrote a script to consolidate fp32 weights: Zero Redundancy Optimizer (ZeRO) - DeepSpeed
which if you use it will allow you to move from N to M gpus, since it generates a single unsharded checkpoint.

You can, of course, write code that will rewrite the checkpoints to change from N to M gpus.

The main reason I wrote it as a separate tool is that it requires huge amount of CPU RAM, so doing this dynamically at run time could be too demanding and may require an instance with a lot of CPU memory


In the case of TP+PP+DP which we use at Megatron-Deepspeed, Tunji has been working on converting any deepspeed PP-style checkpoint from any TP/PP degree to any other TP/PP degree. But the PP checkpoint is slightly different from ZeRO checkpoint. It has each layer in a separate checkpoint, and saves each TP split in a separate checkpoint. So TP is hardwired and can’t be changed once the training started, whereas PP can be changed on the fly. But of course one can reformat the checkpoint to change TP degree as well. The tools are here:

they also include conversion from Meg-DS → Megatron-LM → HF Transformers.
More tools are planned to be added there.


Deepspeed Inference (not ZeRO) uses Tensor Parallelism and at the moment this is still a WIP - I think right now it works with a single checkpoint. But it’d probably make sense to pre-shard it for a faster loading.


Please let me know if I have addresses all the questions.

1 Like