How to do distributed Inference for large models with multiprocess?

I have a model with 20B and 4 A100 GPU with 40G gpu memory. I want to create two processes and each process own 2 gpus, then I can inference with fp16. So how can I do that with accelerate?

I have solved this problem by using DeepSpeed inference.

1 Like

Hi @NobelHu, glad that you managed to make it work ! Would you mind sharing your solution for the community ? We are also thinking about linking an deepspeed inference in our docs so that everyone benefits from it.