Device_map="auto"

It loads a model onto multiple GPUs. Once loaded, the model can be run forward or backward. I have only used ”auto” for training as of yet, and it works.

If you refer to this section:

This only supports the inference of your model, not training. Most of the computation happens behind torch.no_grad() context managers to avoid spending some GPU memory with intermediate activations.

I think it only applies to the offloading to CPU or disk mechanism, but not when the full model can be loaded onto several GPUs.