Deploying TinyLlama Model via SageMaker Inference Endpoint with Custom Setup

sk0da · April 7, 2024, 9:59pm

Hello,

I’m encountering challenges while deploying the TinyLlama model through a SageMaker inference endpoint. I’ve followed the prescribed steps, including downloading model files from the Huggingface page for [TinyLlama-1.1B-Chat-v1.0], which provided me with a set of specific files. After setting up my environment, I attempted various ECR images for the endpoint creation, specifically using 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.1.1-tgi1.4.2-gpu-py310-cu121-ubuntu22.04.

To proceed, I organized my model files and included a code folder for inference.py and requirements.txt, then packed everything into a model.tar.gz and uploaded it to S3. In my inference.py, I’ve scripted model_fn() and transform_fn() to initialize the tokenizer and model. Additionally, I’ve used Terraform to create the SageMaker endpoint, specifying the environment and model data URL.

The complete file structure is as below.

|- .gitattributes
|- README.md
|- config.json
|- eval_results.json
|- generation_config.json
|- model.safetensors
|- special_tokens_map.json
|- tokenizer.json
|- tokenizer.model
|- tokenizer_config.json
|- code
    |----- inference.py
    |----- requirements.txt

Using the below lines of code to create tokenizer and model.

self.tokenizer = AutoTokenizer.from_pretrained(model_dir)
self.model = AutoModelForCasualLM.from_pretrained(
    model_dir,
    local_files_only = True
)

Despite these efforts, invoking the endpoint with a client script seems to bypass my custom inference.py logic, yielding default, unrelated responses. This issue persists despite correct local model reference outcomes and absent CloudWatch logs for my debug entries.

→ Saegmaker endpoint using terraform with the below code snippet.

primary_container = {
    image = "763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.1.1-tgi1.4.2-gpu-py310-cu121-ubuntu22.04"
    mode = "singleModel"
    model_data_url = "s3://<bucket>/models/model.tar.gz"
    enviroment = {
        "SAGEMAKER_REGION" = "us-east-1"
        "SAGEMAKER_PROGRAM" = "inference.py"
        "SAGEMAKER_SUBMIT_DIRECTORY" = "/opt/ml/model/code"
        "HF_MODEL_ID" = "/opt/ml/model"
        "HF_MODEL_QUANTIZE" = "bitsandbytes"
    }
}

I am using below client code to invoke the end point

input_data_list = [
    {"role": "system", "content": "You are a friendly chatbot who always responds"},
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"}
]

serialized_input_data = json.dumps(input_data_list)
payload = json.dumps({"inputs": serialized_input_data})

input_data = 
output = client.invoke_endpoint(
    Endpointname="endpoint_name",
    Body=payload,
    ContentType="application/json",
    Accept="application/json
)

When I execute the client program, it doesn’t seem to use any of the functionality from inference.py file and looks like it is returning the default responses and the response is not even to related to the questions that i am asking. I see it is returning very random answers and i see the reponce includes [{"generated_texts"}: "<random_response>"]

Could someone guide me on ensuring my inference.py executes as intended? Is there an issue with the ECR image I used, or how I’ve structured my Terraform configuration? I noticed a Huggingface discussion suggesting entry_point and source_dir specifications, which I’m unsure how to integrate with Terraform for endpoint creation. I checked sagemaker-huggingface module, i don’t see these two entry_point and source_dir options.

Moreover, how can I verify if my model.tar.gz is correctly loaded, or if SageMaker is defaulting to a different model, hence the unrelated responses?

Appreciate any advice or insights on this matter.

Topic		Replies	Views
Deploying Mixtral8x7B on AWS Sagemaker from S3 Amazon SageMaker	2	481	June 11, 2024
HuggingFaceModel ignores code directory Amazon SageMaker	2	12	June 17, 2025
Inference Toolkit - custom inference with multiple models Amazon SageMaker	1	633	April 4, 2024
Deploying custom inference script with llama2 finetuned model Amazon SageMaker	6	1241	January 4, 2024
Serverless deploy troubles Amazon SageMaker	5	1447	May 16, 2022

Deploying TinyLlama Model via SageMaker Inference Endpoint with Custom Setup

Related topics