How to configure GPU server-side batching with SageMaker HF Hosting?
I want to be able to process multiple GPU inferences in one single batched call (vs doing CPU-GPU travel at every inference). MMS supports that in its open-source flavor, wondering how we can use it with SM HF?
Not sure how the MMS side of things works maybe you can ask at Issues · awslabs/multi-model-server · GitHub.
But once MMS is configured it depends on the task you are using it might be possible that you need to create some custom logic using a inference.py
and overwriting the input_fn
or outpuf_fn
.