Using device_map='auto' for training

songs1 · July 10, 2023, 6:21pm

I was reading the tutorial for how to load large models for inference with accelerate (Handling big models for inference) and saw the note that device_map='auto' is not suitable for training because there are parts of the code wrapped in torch.no_grad contexts. My question is what specific parts actually require no_grad and in what settings do they apply?

I went and tried device_map='auto' anyways to load a pretrained model across 2 GPUs and it seemed like it was still training as training loss decreased and weights were updated. However, given the warning in the docs, I’m not sure if the model was actually trained across all layers/batches/etc. Would it be possible to clarify the warning in the docs with a bit more specificity?

muellerzr · July 10, 2023, 7:07pm

Considering it’s a tutorial for inference, there is no mention of training because training is not supported. It is solely for its title: big model inference

We’re looking into a method for supporting training, however generally you’d want to use DeepSpeed or FSDP for this

songs1 · July 10, 2023, 7:27pm

Hi @muellerzr, thanks for the reply. I guess I had seen the note on training at the end of the “Run the model” section of the tutorial, it’s just a short blurb warning against using the feature for training. But the point is well taken that the intent of the feature is for inference and that ZeRO or FSDP are the officially supported methods right now.

My follow up question is why wouldn’t model parallelism across multiple GPUs work for training? I’m looking through the GPT2Model implementation specifically and saw there’s a pair of functions being deprecated for parallelizing/deparallelizing. While it’s probably not the same implementation as in from_pretrained(device_map='auto'), it seems the idea would still be to put the different layers onto different devices. Also in the forward function, there’s code for moving the intermediate tensors to the device of the distributed layers. This seems like the basic model parallelism paradigm with native torch, as suggested in this torch tutorial. It seems like backprop in torch should be able to handle this exact scenario, of distributed across multiple GPUs. Am I missing something?

edit: did some more reading through various docs/source files, is GPT2 specifically an exception to training with model parallelism? I saw a line in the multi gpu training doc that says:

Transformers status: as of this writing none of the models supports full-PP. GPT2 and T5 models have naive MP support.

vaibhav-101 · August 17, 2023, 7:18am

Hi @songs1 @muellerzr
Can i train my models using naive model parallelism by following the below steps:

For loading model to multiple gpu’s (2 in my case), i use device_map=“auto” in from_pretrained method.
To train the model i use trainer api, since trainer api documentation says that it supports multi-gpu training.

Would above steps result in successful training?

cardcounter · July 21, 2024, 4:03am

Did you find a good tutorial to load a large model on different GPU and use accelerate and Trainer to finetune at a same time?@ songs1

For example device_map=‘auto’, device_map=‘balanced’

kalyani7195 · January 24, 2025, 12:43am

Hi I am also curious about this – I want to first run inference and then add a few additional parameters to further finetune the models — is there any documentation/ tutorial on how to do that and/or with FSDP?

Topic		Replies	Views
Device_map="auto" Beginners	5	20692	September 25, 2024
Trainer API for Model Parallelism using AutoModelForQuestionAnswering 🤗Transformers	1	151	June 5, 2024
Would PyTorch's FSDP work with a model loaded using device_map='auto'? 🤗Transformers	0	257	April 17, 2024
Trainer API for Model Parallelism on Multiple GPUs 🤗Transformers	5	4221	September 10, 2024
Tensor parallelism for customized model 🤗Accelerate	0	238	September 2, 2024

Using device_map='auto' for training

Related topics