I was wondering if this parameter is only set for inference as the document states (Handling big models for inference) or does it actually have an effect during training?


It loads a model onto multiple GPUs. Once loaded, the model can be run forward or backward. I have only used ”auto” for training as of yet, and it works.

If you refer to this section:

This only supports the inference of your model, not training. Most of the computation happens behind torch.no_grad() context managers to avoid spending some GPU memory with intermediate activations.

I think it only applies to the offloading to CPU or disk mechanism, but not when the full model can be loaded onto several GPUs.

Thanks @hansekbrand that is helpful.

With more search, I think it became clear to me that device_map=“auto” is doing naive MP for training (ref: Make all Transformer models compatible with model parallelism · Issue #22561 · huggingface/transformers · GitHub).

Correct, any form of distributed training aside MP is not supported, and as of the next version will raise a proper error if you try to do so:

Thanks @muellerzr, could you also take a look at a related problem I have ZeRO uses more RAM than DDP??