I am trying to deploy meta-llama/Llama-2-70b-chat-hf · Hugging Face on Sagemaker.
Following the code, it recommends me to use a “ml.g5.2xlarge” instance.
It seems a very small instance given the number of parameters, but that is what the code indicates.
However, the deployment (after about 15 mins) fails, outputting:
ai.djl.engine.EngineException: GPU devices are not enough to run 2 partitions.
What is the smallest instance I can run this model with, then?
If the problem is the number of GPU, the only alternative is using a multi-gpu 12x, which cost 5x, at a prohibitive cost. Does this problem applies to all LLM, or is there a small LLM that can run on g5.xlarge/g5.2xlarge?