I’m using this to do distributed inference on the GPT-2 model
model= AutoModelForCausalLM.from_pretrained(f"model-name", return_dict=True, device_map="auto", low_cpu_mem_usage=True, torch_dtype=torch.float16)
However, I see that not all GPUs are being utilized to their maximum capacity when I run nvidia-smi
. Is there any way to speed up the distributed inference? I’m probably missing some crucial things here and would love any thoughts!
GPT2 is quite small, so it may not fully fit across all your GPUs. The output of nvidia-smi would be handy.
In general though, you can try maximizing your batch size as much as possible
1 Like
Here’s the nvidia-smi output I’m running inference on the XL model with a batch size of 200 (more than that throws OOM). I’m generating 25k samples and it’s taking me around an hour. Which is not bad but I was curious if there are any other ways to decrease the inference time.