We are using DeBerta to do sentence embedding. The latency for embedding one sentence is about 30 ms. And If I start to do batch inference, the latency increase linearly as the batch size increases. Is this expected? We are running on a A10 GPU with triton inference server.
for batch in dataloader:
tokens = self._tokenizer(
batch,
truncation=True,
padding=True,
return_tensors="np",
max_length=self._max_len,
)
# Running inference on tokens