Having issues with running parallel, independent inferences on multiple GPUs

bbonnette · September 10, 2024, 11:35pm

Hi there!

I am currently trying to make an API for document summarization, using FastAPI as the backbone and HuggingFace transformers for the inferencing. The idea for now is pretty simple: Send a document to an endpoint, and a summarization will come back.

The host that this will be running on for now has 8 x H100 GPUs (80G VRAM a piece), and ideally I’d like to start out with the Llama 3.1 8B Instruct model. I’m sure this will change over time, but that’s what I am starting with for now.

My goal is to have the same model loaded on each of the GPUs, but I’d like them to independently do inferencing when a request comes in. Meaning, if a single request comes in for a document summary, one and only one of the GPUs will inference on it and return the result back. If 3 requests come in at the same time, then 3 GPUs will be utilized independently without any work sharing. If more than 8 come in, or the pool of GPUs is exhausted, the API just returns a 500-level error.Of course this requires a little bit of locking in the backend side, which I have working correctly.

I have a separate script I wrote that runs 8 worker threads that does nothing more than grabs documents from disk, sends them to the API using requests, and prints the results. So if there are 8 worker threads, it will submit 8 documents as fast as possible, where each thread is “load document from disk, send HTTP request with document in the body, wait for summarization response, then move onto next file”.

However, things seem to be going sideways as I increase the number of requests I make to the API. When I send just one request at a time, one GPU gets fully utilized like you would expect. However, if I send in 4 requests in parallel, each of the 4 GPUs seem to be at reduced core utilization.

It can roughly be summed up as the following:

1 request at a time = 75% utilization of a single GPU
2 requests at a time = 45% utilization of 2 GPUs
3 requests at a time = 35% utilization of 3 GPUs
4 requests at a time = 25% utilization of 4 GPUs
…
8 requests at a time = ~5% utilization of 8 GPUs

What I am trying to achieve is that each request should independently process on a single GPU without any work sharing.

Any idea where I can look? Most of the stuff I’ve come across w.r.t. multi-gpu is regarding training.

When I load the models, I am loading them with :

pipeline = transformers.pipeline(
  "text-generation",
  model=model_config.path,
  device=model_config.device,
  model_kwargs={"torch_dtype": torch.bfloat16},
)

(assume that model_config.path is a HuggingFace model ID, and model_config.device looks like “cuda:0”, “cuda:1”, “cuda:2”, “cuda:N”)

When the API boots up and all the models get loaded, nvtop shows that each model is in fact getting equally loaded across all 8 GPUs. However, when it comes time to inference, I feel like the data is getting scattered, or the inputs get shuffled around.

Local HuggingFace info:

transformers version: 4.44.1
Platform: Linux-6.8.0-41-generic-x86_64-with-glibc2.35
Python version: 3.12.5
Huggingface_hub version: 0.24.6
Safetensors version: 0.4.4
Accelerate version: 0.31.0
Accelerate config: not found
PyTorch version (GPU?): 2.4.0+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: I dont think so.
Using GPU in script?: I believe so? using “cuda:0”, “cuda:1”, etc
GPU type: NVIDIA H100 80GB HBM3 (8 of them)

Any input appreciated!

Topic		Replies	Views
When I try to inference on multiple GPUs using multiple processes, the time for model. generate() becomes very long 🤗Transformers	0	480	June 12, 2023
How to perform parallel inference using multiple GPU Beginners	2	4677	April 10, 2024
Running ASR inference pipeline on multiple GPU's 🤗Transformers	0	137	February 19, 2024
Concurrent inference on a single GPU Beginners	3	2581	November 28, 2021
Multiple gpu not properly parallelized during model.generate() 🤗Transformers	4	1640	October 9, 2022

Having issues with running parallel, independent inferences on multiple GPUs

Related topics