Any tutorial on how to scale the serving of large language models on Kubernetes? I have a T5-large deployed on my AKS cluster (svc → ing → nginx → pod + hpa (gunicorn/flask) but the hpa is not scaling out in time with respect to volume of requests
Any tutorial on how to scale the serving of large language models on Kubernetes? I have a T5-large deployed on my AKS cluster (svc → ing → nginx → pod + hpa (gunicorn/flask) but the hpa is not scaling out in time with respect to volume of requests