Hi! I’m working with an ML model produced by a researcher. I’m trying to set it up to run economically in production on large volumes of text. I know a lot about production engineering, but next to nothing about ML. I’m getting some results that are surprising to me, and I’m hoping for pointers to explanations and advice.
Right now I have the key code broken out into 3 methods, so it’s easy to profile.
def _tokenize(self, text):
return self.tokenizer.encode_plus(text, add_special_tokens=True, return_tensors='pt').to(self.device)
def _run_model(self, model_input):
return self.model(model_input['input_ids'], token_type_ids=model_input['token_type_ids'])[0]
def _extract_results(self, logits):
return logits[0][0].item(), logits[0][1].item()
If I run this using my laptop CPU, I get numbers that make sense to me. For 1459 items, those three methods take 196.4 seconds, or about 135 ms per item. About 2.9 seconds is _tokenize, and the rest is _run_model.
When I switch over to my laptop GPU, I get numbers that mystify me. The same data takes 131.8 seconds. 2.5 seconds to tokenize, and running the model takes 20.3 seconds. But extracting the result takes 108.8 seconds!
The _extract_results method costs the same whether I extract one logit or both. The first one that I extract is slow, whether that’s [0] or [1]. The second one is effectively free.
From nvidia-smi, I can see that the GPU is really being used, and my process is using ~850MB of the 2 GB of GPU RAM. So that seems fine. And if it matters, this is a GeForce 940MX on a Thinkpad t470.
Do these numbers make sense to more experienced hands? I was expecting the GPU runs to be much faster, but if I actually want to get the results, it’s only a little faster.
Thanks!