We are using DeBerta to do sentence embedding. The latency for embedding one sentence is about 30 ms. And If I start to do batch inference, the latency increase linearly as the batch size increases. Is this expected? We are running on a A10 GPU with triton inference server.
for batch in dataloader: tokens = self._tokenizer( batch, truncation=True, padding=True, return_tensors="np", max_length=self._max_len, ) # Running inference on tokens