First, I deployed a BlenderBot model without any customization. Then, I added a handler.py
file containing the code below to make sure it uses model.generate()
rather than pipeline()
(that I assumed is better to use the maximum capacity of the GPU):
import torch
from transformers import BlenderbotTokenizer, BlenderbotForConditionalGeneration
device = "cuda:0" if torch.cuda.is_available() else "cpu"
print("device=", device)
model_name = "facebook/blenderbot-400M-distill"
class EndpointHandler():
def __init__(self, path=""):
self.model = BlenderbotForConditionalGeneration.from_pretrained(model_name)
self.tokenizer = BlenderbotTokenizer.from_pretrained(model_name, device_map="auto", load_in_8bit=True)
def __call__(self, data):
inputs = data.pop("inputs", data)
parameters = data.pop("parameters", None)
inputs = self.tokenizer([inputs], return_tensors="pt").to(device)
model = self.model.to(device)
reply_ids = model.generate(**inputs)
prediction = self.tokenizer.batch_decode(reply_ids, skip_special_tokens=True)[0]
return prediction
It works and utilizes the GPU without any issue, but the problem is that it never reaches more than 50% GPU usage, regardless of the number of requests to the instance. As a result, autoscaling never occurs (since GPU utilization should surpass 80% for more than 2 minutes to trigger autoscaling ). I sent many concurrent requests using load test tools like ab
, hey
, and ali
, but none of them caused the GPU to surpass 50% utilization.
I’m wondering how to maximize GPU utilization while using a generative language model like BlenderBot.