Truncated un-finished response after deploying hugging-face models

zhp27 · January 19, 2024, 6:10pm

I have deployed a couple of LLMs, Zypher and llama 13 among others and changed the max-token and other related parameters, but each time after inference I get a couple of tokens and an incomplete response. Not sure what am I doing wrong. Any help much appreciated.

My prompt is large and has some instructions. (format is decorated based on the model)

import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

try:
	role = sagemaker.get_execution_role()
except ValueError:
	iam = boto3.client('iam')
	role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

# Hub Model configuration. https://huggingface.co/models
hub = {
	'HF_MODEL_ID':'meta-llama/Llama-2-13b-chat-hf',
	'SM_NUM_GPUS': json.dumps(4),
    'MAX_INPUT_LENGTH': json.dumps(4095), 
    'MAX_TOTAL_TOKENS': json.dumps(8196),
    'max_new_tokens' : json.dumps(8000),


	'HUGGING_FACE_HUB_TOKEN': 'hf_xyz'
}
huggingface_model = HuggingFaceModel(
	image_uri=get_huggingface_llm_image_uri("huggingface",version="1.1.0"),
	env=hub,
	role=role, 
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
	initial_instance_count=1,
	instance_type="ml.g5.12xlarge",
	container_startup_health_check_timeout=1000,
  )

Topic		Replies	Views
AWS Sagemaker doesn't return the full response Amazon SageMaker	1	123	July 17, 2024
Sagemaker model generates incomplete responses (or even completely random output) Amazon SageMaker	0	179	May 23, 2024
HuggingFaceModel ignores code directory Amazon SageMaker	2	12	June 17, 2025
Deploying custom inference script with llama2 finetuned model Amazon SageMaker	6	1241	January 4, 2024
Error hosting endpoint when deploying model Amazon SageMaker	2	3024	March 27, 2024

Truncated un-finished response after deploying hugging-face models

Related topics