Sagemaker deployment fails for local llama2 model

Hi,
I want to deploy my own model on sagemaker.
usually i do this providing just a model id.
But in this case I want to use a local model (llama2 converted in huggingface format).

    llm_image = get_huggingface_llm_image_uri(
        "huggingface",
        version="0.8.2"
    )
    # create HuggingFaceModel
    llm_model = HuggingFaceModel(
        model_data=s3_location,
        transformers_version="4.31.0",
        role=role,
        model_server_workers=4,
        image_uri=llm_image,
        # env=config
    )

The archive is structured as mentioned in the docs

model.tar.gz/
|- pytorch_model.bin
|- ....
|- code/
  |- inference.py
  |- requirements.txt

inference.py


def model_fn(model_dir):
    device = 0 if torch.cuda.is_available() else "cpu"
    pipe = pipeline(model=model_dir, device=device)
    return pipe


def predict_fn(data, model):
    print("inside predict_fn")
    print("data")
    print(data)

    return model(data["text"])

requirements.txt

git+https://github.com/huggingface/transformers.git
torch==1.13.1
boto3

I do get an error about a missing model id.

HF_MODEL_ID must be set

I dont quite understand it since i provide ‘model_data’
@philschmid Do you have any idea why the ID is required here?
Your help is highly appreciated. kindest regards,
Philip

this solves the issue
'HF_MODEL_ID': '/opt/ml/model',

1 Like
it should be like this ,where we are providing TGI config

# TGI config
config = {
  'HF_MODEL_ID': "/opt/ml/model", # model_id from hf.co/models
  'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
  'MAX_INPUT_LENGTH': json.dumps(1024),  # Max length of input text
  'MAX_TOTAL_TOKENS': json.dumps(2048),  # Max length of the generation (including input text)
  # 'HF_MODEL_QUANTIZE': "bitsandbytes", # comment in to quantize
}

I have fine-tuned a text-generation model with peft and qlora, If I want to deploy it in sagemaker from S3 , the inference.py should charge the adapter to?