Distributed Inference on GPT-2

jaydeepb · May 2, 2024, 12:17am

I’m using this to do distributed inference on the GPT-2 model

model= AutoModelForCausalLM.from_pretrained(f"model-name", return_dict=True, device_map="auto", low_cpu_mem_usage=True, torch_dtype=torch.float16)

However, I see that not all GPUs are being utilized to their maximum capacity when I run nvidia-smi. Is there any way to speed up the distributed inference? I’m probably missing some crucial things here and would love any thoughts!

muellerzr · May 2, 2024, 12:25am

GPT2 is quite small, so it may not fully fit across all your GPUs. The output of nvidia-smi would be handy.

In general though, you can try maximizing your batch size as much as possible

jaydeepb · May 2, 2024, 1:21am

Here’s the nvidia-smi output I’m running inference on the XL model with a batch size of 200 (more than that throws OOM). I’m generating 25k samples and it’s taking me around an hour. Which is not bad but I was curious if there are any other ways to decrease the inference time.

Topic		Replies	Views
How to do distributed Inference for large models with multiprocess? 🤗Accelerate	3	635	May 26, 2024
When I try to inference on multiple GPUs using multiple processes, the time for model. generate() becomes very long 🤗Transformers	0	476	June 12, 2023
Multi-gpu inference Beginners	2	842	May 14, 2024
Accelarator can't detect my GPUs? 🤗Accelerate	10	1585	March 29, 2024
Pipeline inference with multi gpus Beginners	0	617	March 13, 2022

Distributed Inference on GPT-2

Related topics