How to do distributed Inference for large models with multiprocess?

I have a model with 20B and 4 A100 GPU with 40G gpu memory. I want to create two processes and each process own 2 gpus, then I can inference with fp16. So how can I do that with accelerate?

I have solved this problem by using DeepSpeed inference.

2 Likes

Hi @NobelHu, glad that you managed to make it work ! Would you mind sharing your solution for the community ? We are also thinking about linking an deepspeed inference in our docs so that everyone benefits from it.

Also want to know!