Using device_map='auto' for training

I was reading the tutorial for how to load large models for inference with accelerate (Handling big models for inference) and saw the note that device_map='auto' is not suitable for training because there are parts of the code wrapped in torch.no_grad contexts. My question is what specific parts actually require no_grad and in what settings do they apply?

I went and tried device_map='auto' anyways to load a pretrained model across 2 GPUs and it seemed like it was still training as training loss decreased and weights were updated. However, given the warning in the docs, I’m not sure if the model was actually trained across all layers/batches/etc. Would it be possible to clarify the warning in the docs with a bit more specificity?

1 Like

Considering it’s a tutorial for inference, there is no mention of training because training is not supported. It is solely for its title: big model inference

We’re looking into a method for supporting training, however generally you’d want to use DeepSpeed or FSDP for this


Hi @muellerzr, thanks for the reply. I guess I had seen the note on training at the end of the “Run the model” section of the tutorial, it’s just a short blurb warning against using the feature for training. But the point is well taken that the intent of the feature is for inference and that ZeRO or FSDP are the officially supported methods right now.

My follow up question is why wouldn’t model parallelism across multiple GPUs work for training? I’m looking through the GPT2Model implementation specifically and saw there’s a pair of functions being deprecated for parallelizing/deparallelizing. While it’s probably not the same implementation as in from_pretrained(device_map='auto'), it seems the idea would still be to put the different layers onto different devices. Also in the forward function, there’s code for moving the intermediate tensors to the device of the distributed layers. This seems like the basic model parallelism paradigm with native torch, as suggested in this torch tutorial. It seems like backprop in torch should be able to handle this exact scenario, of distributed across multiple GPUs. Am I missing something?

edit: did some more reading through various docs/source files, is GPT2 specifically an exception to training with model parallelism? I saw a line in the multi gpu training doc that says:

:hugs: Transformers status: as of this writing none of the models supports full-PP. GPT2 and T5 models have naive MP support.

Hi @songs1 @muellerzr
Can i train my models using naive model parallelism by following the below steps:

  1. For loading model to multiple gpu’s (2 in my case), i use device_map=“auto” in from_pretrained method.
  2. To train the model i use trainer api, since trainer api documentation says that it supports multi-gpu training.

Would above steps result in successful training?