Batching in SageMaker Inference Toolkit

Thanks for putting together this great toolkit. I had a question of how inference batching is handled. I noticed that the examples here all appear to have a single input request. Once deployed, if multiple requests are made to the endpoint of a deployed model at once or in quick succession, are they automatically batched under the hood, or is there something you need to do before hitting the endpoint to feed in a batch of inputs manually?

1 Like

pinging @philschmid and @jeffboudier in case they hadn’t seen this!

Hello @charlesatftl,

Thank you for the nice feedback.
Since the Inference Toolkit is based on top of the transformers pipelines i currently handle batching the same way as the pipelines are doing. This means a few pipelines support batching, e.g. text-classification or zero-shot-classifciation, because it is faster other pipelines like question-answering are not supporting batching they are doing sequential predictions.

But you can send to all pipelines/inference toolkit a request with multiple inputs, e.g.

{"inputs": ["sentence 1","sentence 2"]}

The pipeline either then batches the request or runs it in sequence.

For your dynamic batching that’s currently not supported, but you could create a custom inference.py and create it yourself.

P.S. Batching in NLP is not as efficient as in CV for example, since when doing batching all sequence need to padded to the same length, which could be slower than doing sequential predictions.