Inference API response time scales linearly with number of inputs

joachim · November 1, 2021, 2:35pm

I observe a linear relationship between response time and number of input sentences when using the Accelerated Inference API, even when the number of inputs is small so it should still fit in the same batch.

With my own model, I observe a response time of about 3 seconds when sending a single sentence with the “inputs” parameter, and about 12 seconds when sending 4 sentences simultaneously.

This leads me to believe that there is no batching at all at inference time, and sentences are processed sequentially.

Is that correct? Can I somehow enforce a batch size of N?

Topic		Replies	Views
Batched pipeline inference has little speed improvement on longer texts Beginners	1	1888	October 27, 2023
Is there an response length limit for the inference API? Inference Endpoints on the Hub	0	444	March 28, 2024
Batching in SageMaker Inference Toolkit Amazon SageMaker	2	981	September 5, 2021
Multi-gpu inference Beginners	2	824	May 14, 2024
Concurrent inference on a single GPU Beginners	3	2503	November 28, 2021

Inference API response time scales linearly with number of inputs

Related topics