I was wondering if this parameter is only set for inference as the document states (Handling big models for inference) or does it actually have an effect during training?
Thanks!
I was wondering if this parameter is only set for inference as the document states (Handling big models for inference) or does it actually have an effect during training?
Thanks!
It loads a model onto multiple GPUs. Once loaded, the model can be run forward or backward. I have only used ”auto” for training as of yet, and it works.
If you refer to this section:
This only supports the inference of your model, not training. Most of the computation happens behind
torch.no_grad()
context managers to avoid spending some GPU memory with intermediate activations.
I think it only applies to the offloading to CPU or disk mechanism, but not when the full model can be loaded onto several GPUs.
Thanks @hansekbrand that is helpful.
With more search, I think it became clear to me that device_map=“auto” is doing naive MP for training (ref: Make all Transformer models compatible with model parallelism · Issue #22561 · huggingface/transformers · GitHub).
Correct, any form of distributed training aside MP is not supported, and as of the next version will raise a proper error if you try to do so:
Thanks @muellerzr, could you also take a look at a related problem I have ZeRO uses more RAM than DDP??
training mode does not support device_map=“auto”, it will throw an error suggesting you load model on single device.