Data Parallel Multi GPU Inference

varadhbhatnagar · May 26, 2023, 10:58am

Found the following statement:
You don’t need to prepare a model if it is used only for inference without any kind of mixed precision
in
accelerate.Accelerator.prepare() documentation: Accelerator

In data-parallel multi-gpu inference, we want a model copy to reside on each GPU. How can we achieve that without passing the model through prepare() ?

muellerzr · May 26, 2023, 12:27pm

You just move the model to the device. Check out the new distributed inference tutorial, and install accelerate from dev to make use of the new API if you want to do split_by_processes. Otherwise pass your dataloader to Accelerator.prepare and do model.to(state.device):

Using the DDP wrapper on your model is only relevant when you want to update the gradients (that’s what it’s designed there for), so inference just load the model on the device normally

varadhbhatnagar · May 29, 2023, 5:54am

Thanks @muellerzr for your reply. Is there any benefit in using split_between_processes() over accelerate.Accelerator().prepare() on a dataloader?

muellerzr · May 29, 2023, 5:30pm

It’s useful if you don’t want to make a DataLoader, or have things that can’t go in there easily (like prompts in that example)

varadhbhatnagar · May 30, 2023, 9:29am

Understood, thanks @muellerzr !

varadhbhatnagar · September 6, 2023, 6:45am

Hi @muellerzr, is there a way I can do distributed inference using model sharding (FSDP) ?

muellerzr · September 6, 2023, 1:48pm

Is there a reason you want to do so instead of using device_map/big model inference? This can help narrow down my recommendation

varadhbhatnagar · September 7, 2023, 5:59am

@muellerzr My model size is very close to the total GPU memory and from what I understood in this article, I cannot run batches in parallel on all GPUs if I use device_map="auto".

I was wondering if it’s possible to do inference in FSDP style, i.e. the model layers get sharded across all GPUs and layers get exchanged on demand so that I can process batches in parallel ?

Is there an end to end example (in the Example Zoo) that I can refer to?

varadhbhatnagar · September 11, 2023, 3:45am

@sgugger @muellerzr Any thoughts on this?

smangrul · September 15, 2023, 9:58am

Hello @varadhbhatnagar, you can use FSDP for distributed inference as long as you aren’t using the generate method as FSDP is incompatible with generate (mentioned in the docs here: Fully Sharded Data Parallel (huggingface.co)).

For example, the docs here show it Fully Sharded Data Parallel (huggingface.co). The example nlp_example.py also computes metrics on eval set which should mimic the distributed inference.

Topic		Replies	Views
FSDP accelerate.prepare gives OOM. How to load model into single GPU, then distribute shards? 🤗Accelerate	2	1130	January 24, 2024
Inflated GPU memory footprint of model prepared via accelerate 🤗Accelerate	5	766	September 15, 2023
Loading a HF Model in Multiple GPUs and Run Inferences in those GPUs 🤗Accelerate	10	9674	October 16, 2024
Tensor parallelism for customized model 🤗Accelerate	0	236	September 2, 2024
How to do distributed Inference for large models with multiprocess? 🤗Accelerate	3	638	May 26, 2024

Data Parallel Multi GPU Inference

Related topics