Deploying TinyLlama Model via SageMaker Inference Endpoint with Custom Setup

Hello,

I’m encountering challenges while deploying the TinyLlama model through a SageMaker inference endpoint. I’ve followed the prescribed steps, including downloading model files from the Huggingface page for [TinyLlama-1.1B-Chat-v1.0], which provided me with a set of specific files. After setting up my environment, I attempted various ECR images for the endpoint creation, specifically using 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.1.1-tgi1.4.2-gpu-py310-cu121-ubuntu22.04.

To proceed, I organized my model files and included a code folder for inference.py and requirements.txt, then packed everything into a model.tar.gz and uploaded it to S3. In my inference.py, I’ve scripted model_fn() and transform_fn() to initialize the tokenizer and model. Additionally, I’ve used Terraform to create the SageMaker endpoint, specifying the environment and model data URL.

The complete file structure is as below.

|- .gitattributes
|- README.md
|- config.json
|- eval_results.json
|- generation_config.json
|- model.safetensors
|- special_tokens_map.json
|- tokenizer.json
|- tokenizer.model
|- tokenizer_config.json
|- code
    |----- inference.py
    |----- requirements.txt

Using the below lines of code to create tokenizer and model.

self.tokenizer = AutoTokenizer.from_pretrained(model_dir)
self.model = AutoModelForCasualLM.from_pretrained(
    model_dir,
    local_files_only = True
)

Despite these efforts, invoking the endpoint with a client script seems to bypass my custom inference.py logic, yielding default, unrelated responses. This issue persists despite correct local model reference outcomes and absent CloudWatch logs for my debug entries.

→ Saegmaker endpoint using terraform with the below code snippet.

primary_container = {
    image = "763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.1.1-tgi1.4.2-gpu-py310-cu121-ubuntu22.04"
    mode = "singleModel"
    model_data_url = "s3://<bucket>/models/model.tar.gz"
    enviroment = {
        "SAGEMAKER_REGION" = "us-east-1"
        "SAGEMAKER_PROGRAM" = "inference.py"
        "SAGEMAKER_SUBMIT_DIRECTORY" = "/opt/ml/model/code"
        "HF_MODEL_ID" = "/opt/ml/model"
        "HF_MODEL_QUANTIZE" = "bitsandbytes"
    }
}

I am using below client code to invoke the end point

input_data_list = [
    {"role": "system", "content": "You are a friendly chatbot who always responds"},
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"}
]

serialized_input_data = json.dumps(input_data_list)
payload = json.dumps({"inputs": serialized_input_data})

input_data = 
output = client.invoke_endpoint(
    Endpointname="endpoint_name",
    Body=payload,
    ContentType="application/json",
    Accept="application/json
)

When I execute the client program, it doesn’t seem to use any of the functionality from inference.py file and looks like it is returning the default responses and the response is not even to related to the questions that i am asking. I see it is returning very random answers and i see the reponce includes [{"generated_texts"}: "<random_response>"]

Could someone guide me on ensuring my inference.py executes as intended? Is there an issue with the ECR image I used, or how I’ve structured my Terraform configuration? I noticed a Huggingface discussion suggesting entry_point and source_dir specifications, which I’m unsure how to integrate with Terraform for endpoint creation. I checked sagemaker-huggingface module, i don’t see these two entry_point and source_dir options.

Moreover, how can I verify if my model.tar.gz is correctly loaded, or if SageMaker is defaulting to a different model, hence the unrelated responses?

Appreciate any advice or insights on this matter.