Hi there!
I am currently trying to make an API for document summarization, using FastAPI as the backbone and HuggingFace transformers for the inferencing. The idea for now is pretty simple: Send a document to an endpoint, and a summarization will come back.
The host that this will be running on for now has 8 x H100 GPUs (80G VRAM a piece), and ideally I’d like to start out with the Llama 3.1 8B Instruct model. I’m sure this will change over time, but that’s what I am starting with for now.
My goal is to have the same model loaded on each of the GPUs, but I’d like them to independently do inferencing when a request comes in. Meaning, if a single request comes in for a document summary, one and only one of the GPUs will inference on it and return the result back. If 3 requests come in at the same time, then 3 GPUs will be utilized independently without any work sharing. If more than 8 come in, or the pool of GPUs is exhausted, the API just returns a 500-level error.Of course this requires a little bit of locking in the backend side, which I have working correctly.
I have a separate script I wrote that runs 8 worker threads that does nothing more than grabs documents from disk, sends them to the API using requests
, and prints the results. So if there are 8 worker threads, it will submit 8 documents as fast as possible, where each thread is “load document from disk, send HTTP request with document in the body, wait for summarization response, then move onto next file”.
However, things seem to be going sideways as I increase the number of requests I make to the API. When I send just one request at a time, one GPU gets fully utilized like you would expect. However, if I send in 4 requests in parallel, each of the 4 GPUs seem to be at reduced core utilization.
It can roughly be summed up as the following:
1 request at a time = 75% utilization of a single GPU
2 requests at a time = 45% utilization of 2 GPUs
3 requests at a time = 35% utilization of 3 GPUs
4 requests at a time = 25% utilization of 4 GPUs
…
8 requests at a time = ~5% utilization of 8 GPUs
What I am trying to achieve is that each request should independently process on a single GPU without any work sharing.
Any idea where I can look? Most of the stuff I’ve come across w.r.t. multi-gpu is regarding training.
When I load the models, I am loading them with :
pipeline = transformers.pipeline(
"text-generation",
model=model_config.path,
device=model_config.device,
model_kwargs={"torch_dtype": torch.bfloat16},
)
(assume that model_config.path
is a HuggingFace model ID, and model_config.device
looks like “cuda:0”, “cuda:1”, “cuda:2”, “cuda:N”)
When the API boots up and all the models get loaded, nvtop
shows that each model is in fact getting equally loaded across all 8 GPUs. However, when it comes time to inference, I feel like the data is getting scattered, or the inputs get shuffled around.
Local HuggingFace info:
transformers
version: 4.44.1- Platform: Linux-6.8.0-41-generic-x86_64-with-glibc2.35
- Python version: 3.12.5
- Huggingface_hub version: 0.24.6
- Safetensors version: 0.4.4
- Accelerate version: 0.31.0
- Accelerate config: not found
- PyTorch version (GPU?): 2.4.0+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: I dont think so.
- Using GPU in script?: I believe so? using “cuda:0”, “cuda:1”, etc
- GPU type: NVIDIA H100 80GB HBM3 (8 of them)
Any input appreciated!