I was reading the tutorial for how to load large models for inference with accelerate (Handling big models for inference) and saw the note that device_map='auto' is not suitable for training because there are parts of the code wrapped in torch.no_grad contexts. My question is what specific parts actually require no_grad and in what settings do they apply?
I went and tried device_map='auto' anyways to load a pretrained model across 2 GPUs and it seemed like it was still training as training loss decreased and weights were updated. However, given the warning in the docs, Iâm not sure if the model was actually trained across all layers/batches/etc. Would it be possible to clarify the warning in the docs with a bit more specificity?
Considering itâs a tutorial for inference, there is no mention of training because training is not supported. It is solely for its title: big model inference
Weâre looking into a method for supporting training, however generally youâd want to use DeepSpeed or FSDP for this
Hi @muellerzr, thanks for the reply. I guess I had seen the note on training at the end of the âRun the modelâ section of the tutorial, itâs just a short blurb warning against using the feature for training. But the point is well taken that the intent of the feature is for inference and that ZeRO or FSDP are the officially supported methods right now.
My follow up question is why wouldnât model parallelism across multiple GPUs work for training? Iâm looking through the GPT2Model implementation specifically and saw thereâs a pair of functions being deprecated for parallelizing/deparallelizing. While itâs probably not the same implementation as in from_pretrained(device_map='auto'), it seems the idea would still be to put the different layers onto different devices. Also in the forward function, thereâs code for moving the intermediate tensors to the device of the distributed layers. This seems like the basic model parallelism paradigm with native torch, as suggested in this torch tutorial. It seems like backprop in torch should be able to handle this exact scenario, of distributed across multiple GPUs. Am I missing something?
edit: did some more reading through various docs/source files, is GPT2 specifically an exception to training with model parallelism? I saw a line in the multi gpu training doc that says:
Transformers status: as of this writing none of the models supports full-PP. GPT2 and T5 models have naive MP support.