I observe a linear relationship between response time and number of input sentences when using the Accelerated Inference API, even when the number of inputs is small so it should still fit in the same batch.
With my own model, I observe a response time of about 3 seconds when sending a single sentence with the “inputs” parameter, and about 12 seconds when sending 4 sentences simultaneously.
This leads me to believe that there is no batching at all at inference time, and sentences are processed sequentially.
Is that correct? Can I somehow enforce a batch size of N?