I have deployed a couple of LLMs, Zypher and llama 13 among others and changed the max-token and other related parameters, but each time after inference I get a couple of tokens and an incomplete response. Not sure what am I doing wrong. Any help much appreciated.
My prompt is large and has some instructions. (format is decorated based on the model)
import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client('iam')
role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']
# Hub Model configuration. https://huggingface.co/models
hub = {
'HF_MODEL_ID':'meta-llama/Llama-2-13b-chat-hf',
'SM_NUM_GPUS': json.dumps(4),
'MAX_INPUT_LENGTH': json.dumps(4095),
'MAX_TOTAL_TOKENS': json.dumps(8196),
'max_new_tokens' : json.dumps(8000),
'HUGGING_FACE_HUB_TOKEN': 'hf_xyz'
}
huggingface_model = HuggingFaceModel(
image_uri=get_huggingface_llm_image_uri("huggingface",version="1.1.0"),
env=hub,
role=role,
)
# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type="ml.g5.12xlarge",
container_startup_health_check_timeout=1000,
)