Hello
When training a model on one GPU with deep speed, I was able to save the model and to convert the weights to a pytorch_model.bin
with the given zero_to_fp32.py
script and train again.
(see here : [Solved] Cannot restart training from deepspeed checkpoint)
However, I’m not able to load again this model and train it this time on two gpus :
deepspeed.runtime.zero.utils.ZeRORuntimeException: The checkpoint being loaded used a DP world size of 1 but the current world size is 2. Automatic adjustment of ZeRO's optimizer state partitioning with a new world size is not currently supported.
The error seems clear but is there any way to load the model (CPU or GPU) like for inference and then split it among the two gpus ?
Thanks in advance,
Have a great day