What is the latency expectation of DeBerta when doing batch inference

flexwang · June 30, 2023, 12:54am

We are using DeBerta to do sentence embedding. The latency for embedding one sentence is about 30 ms. And If I start to do batch inference, the latency increase linearly as the batch size increases. Is this expected? We are running on a A10 GPU with triton inference server.

for batch in dataloader:
                tokens = self._tokenizer(
                    batch,
                    truncation=True,
                    padding=True,
                    return_tensors="np",
                    max_length=self._max_len,
                )
                # Running inference on tokens

Topic		Replies	Views
Reduce inference latency of text embedding endpoint Amazon SageMaker	1	1108	July 12, 2022
Slow inference using most recent docker image Amazon SageMaker	10	3196	March 21, 2022
50 ms inference, 500 ms latency Inference Endpoints on the Hub	0	182	February 27, 2024
Fine-Tuning DeBERTa Produces Non-Results 🤗Transformers	3	3063	September 21, 2022
Finetune only certain embeddings 🤗Transformers	0	11	July 19, 2024

What is the latency expectation of DeBerta when doing batch inference

Related topics