When to use SageMaker multi model endpoint

I’m wondering when it becomes more efficient to use multi model endpoints with SageMaker.

Right now I’m working on a project that uses PyTorch/huggingface transformer neural nets to classify words in natural language. This is the first model. The second model then takes the output of the first model and runs through it through a second transformer in order to calculate a similarity metric with a value in some database.

At first I was going to just separate both of these models completely, and put them on separate endpoints and connect the logic together with some wrapper script, but now I’m thinking it may be better to put both models on the same endpoint by utilizing a multi model endpoint.

Would a multi model endpoint make more sense for this use case? If so, is there any good article or documentation pertaining to how this can be achieved? Thanks!

@bennicholl I think this really depends on your use case, limitations, budget, and load.

If you have a huge load and you need to scale the models up and down and the latency is different for both then it might make more sense to keep them separate.
If the models are always used sequentially then you could also put both models in to the same endpoint with a inference.py rather then creating a multi-model endpoint, that way you can leverage GPU (GPUs are currently not supported by MME)
If you have quite infrequent load and not latency requirements then you could go with SageMaker Serverless instead of MME.

If so, is there any good article or documentation pertaining to how this can be achieved?

Could you please explain what you mean by that