How can we maximize the GPU utilization in Inference Endpoints?

First, I deployed a BlenderBot model without any customization. Then, I added a handler.py file containing the code below to make sure it uses model.generate() rather than pipeline() (that I assumed is better to use the maximum capacity of the GPU):

import torch
from transformers import BlenderbotTokenizer, BlenderbotForConditionalGeneration

device = "cuda:0" if torch.cuda.is_available() else "cpu"
print("device=", device)

model_name = "facebook/blenderbot-400M-distill"
class EndpointHandler():
    def __init__(self, path=""):
        self.model = BlenderbotForConditionalGeneration.from_pretrained(model_name)
        self.tokenizer = BlenderbotTokenizer.from_pretrained(model_name, device_map="auto", load_in_8bit=True)

    def __call__(self, data):
        inputs = data.pop("inputs", data)
        parameters = data.pop("parameters", None)

        inputs = self.tokenizer([inputs], return_tensors="pt").to(device)
        model = self.model.to(device)
        reply_ids = model.generate(**inputs)
        prediction = self.tokenizer.batch_decode(reply_ids, skip_special_tokens=True)[0]

        return prediction

It works and utilizes the GPU without any issue, but the problem is that it never reaches more than 50% GPU usage, regardless of the number of requests to the instance. As a result, autoscaling never occurs (since GPU utilization should surpass 80% for more than 2 minutes to trigger autoscaling ). I sent many concurrent requests using load test tools like ab, hey, and ali, but none of them caused the GPU to surpass 50% utilization.

I’m wondering how to maximize GPU utilization while using a generative language model like BlenderBot.

I discovered that the issue lies with the model itself, meaning facebook/blenderbot-400M-distill , which doesn’t fully utilize the GPU’s capacity. To address this, I switched to a larger model, facebook/blenderbot-3B , which successfully utilized even more than 90% of the GPU capacity to handle numerous requests. And so it initiated the second replica after 2 minutes