Deploying Fine-Tune Falcon 40B with QLoRA on Sagemaker Inference Error

Hi,

Note: I can only post 2 links because my account is new. See the next posts for all the missing links.
We have successfully deployed a fine-tuned Falcon 7B & Falcon 7B instruct model to an Amazon SageMaker inference endpoint.

The scripts/requirements.txt are the same as in @philschmid’s update.

In the scripts/run_clm.py file we added

  • modules_to_save=["lm_head"] to the LoraConfig below line 79
  • trust_remote_code=True to the AutoPeftModelForCausalLM.from_pretrained parameters in line 163.

In the 28_train_llms_with_qlora/sagemaker-notebook.ipynb we added

  • "model_revision": "2f5c3cd4eace6be6c0f12981f377fb35e5bf6ee5" to the hyperparameters dict for Falcon Instruct 7B in line 312. Use the model revision for your Falcon model type that is newer than the Revert in-library PR.

We trained the model with the above changes.

For the deployment we built the latest v0.9.3 release of the Hugging Face text generation inference container image. Build your own container image and push it to an Amazon Elastic Container Registry repository.

  1. git clone -b v0.9.3 [https://]]github.com/huggingface/text-generation-inference.git
  2. cd text-generation-inference
  3. docker build -t <YOUR_AWS_ACCOUNT_ID>.dkr.ecr.<YOUR_AWS_REGION>.amazonaws.com/<YOUR_ECR_REPO>:0.9.3 --target sagemaker .
  4. docker push <YOUR_AWS_ACCOUNT_ID>.dkr.ecr.<YOUR_AWS_REGION>.amazonaws.com/<YOUR_ECR_REPO>:0.9.3

Our Notebook to deploy the model looks like this:

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Deploy Large Language Models (LLMs) to Amazon SageMaker using Hugging Face Text Generation Inference Container\n",
    "\n",
    "This is an example on how to deploy the open-source LLMs to Amazon SageMaker for inference using your own build of the Hugging Face TGI container.\n",
    "\n",
    "This examples demonstrate how to deploy a fine-tuned model from Amazon S3 to Amazon SageMaker.\n",
    "\n",
    "If you want to learn more about the Hugging Face TGI container check out the Hugging Face TGI GitHub repository. Lets get started!\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Setup development environment\n",
    "\n",
    "We are going to use the `sagemaker` python SDK to deploy to Amazon SageMaker. We need to make sure to have an AWS account configured and the `sagemaker` python SDK installed. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "!pip install \"sagemaker==2.163.0\" \"huggingface_hub\" \"hf-transfer\" --upgrade --quiet"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import sagemaker\n",
    "import boto3\n",
    "sess = sagemaker.Session()\n",
    "\n",
    "\n",
    "# sagemaker session bucket ->| used for uploading data, models and logs\n",
    "# sagemaker will automatically create this bucket if it not exists\n",
    "sagemaker_session_bucket=None\n",
    "if sagemaker_session_bucket is None and sess is not None:\n",
    "    # set to default bucket if a bucket name is not given\n",
    "    sagemaker_session_bucket = sess.default_bucket()\n",
    "\n",
    "try:\n",
    "    role = sagemaker.get_execution_role()\n",
    "except ValueError:\n",
    "    iam = boto3.client('iam')\n",
    "    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']\n",
    "\n",
    "sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)\n",
    "\n",
    "print(f\"sagemaker role arn: {role}\")\n",
    "print(f\"sagemaker session region: {sess.boto_region_name}\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "s3_model_uri = \"<Amazon S3 URI that contains the model.tar.gz of your fine-tuned model>\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Retrieve your HuggingFace TGI container image\n",
    "\n",
    "Compared to deploying regular Hugging Face models we first need to retrieve the container uri and provide it to our `HuggingFaceModel` model class with a `image_uri` pointing to the image.\n",
    "At the time of writing the Hugging Face TGI container image for Amazon SageMake is on version 0.8.2. Version 0.8.2 did not work for us. So we've built our own TGI container image for sagemaker stage docker build from the latest Hugging Face TGI GitHub branch v0.9.3 and pushed the container image to a private Amazon Elastic Container Registry repo.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "llm image uri: 843197046435.dkr.ecr.eu-west-1.amazonaws.com/huggingface/text-generation-inference:0.9.2\n"
     ]
    }
   ],
   "source": [
    "llm_image = \"<YOUR_AWS_ACCOUNT_ID>.dkr.ecr.<YOUR_AWS_REGION>.amazonaws.com/<YOUR_ECR_REPO>:0.9.2\"\n",
    "# print ecr image uri\n",
    "print(f\"llm image uri: {llm_image}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Deploy finetuned-model to Amazon SageMaker\n",
    "\n",
    "To deploy your model to Amazon SageMaker we create a `HuggingFaceModel` model class and define our endpoint configuration including the `hf_model_id`, `instance_type` etc. We will use a `g5.12xlarge` instance type, which has 4 NVIDIA A10G GPUs and 96GB of GPU memory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import json\n",
    "from sagemaker.huggingface import HuggingFaceModel\n",
    "\n",
    "# sagemaker config\n",
    "instance_type = \"ml.g5.2xlarge\"\n",
    "number_of_gpu = 1\n",
    "health_check_timeout = 300\n",
    "\n",
    "# Define Model and Endpoint configuration parameter\n",
    "config = {\n",
    "  'HF_MODEL_ID': \"/opt/ml/model\", # path to where sagemaker stores the mode\n",
    "  'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica\n",
    "  'MAX_INPUT_LENGTH': json.dumps(1024), # Max length of input text\n",
    "  'MAX_TOTAL_TOKENS': json.dumps(2048), # Max length of the generation (including input text)\n",
    "  # 'HF_MODEL_QUANTIZE': \"bitsandbytes\",# Comment in to quantize\n",
    "}\n",
    "\n",
    "# create HuggingFaceModel with the image uri\n",
    "llm_model = HuggingFaceModel(\n",
    "  role=role,\n",
    "  image_uri=llm_image,\n",
    "  model_data=s3_model_uri,\n",
    "  env=config\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "---------!"
     ]
    }
   ],
   "source": [
    "# Deploy model to an endpoint\n",
    "llm = llm_model.deploy(\n",
    "  initial_instance_count=1,\n",
    "  instance_type=instance_type,\n",
    "  # volume_size=400, # If using an instance with local SSD storage, volume_size must be None, e.g. p4 but not p3\n",
    "  container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model\n",
    ")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After we have created the `HuggingFaceModel` we can deploy it to Amazon SageMaker using the `deploy` method. We will deploy the model with the `ml.g5.12xlarge` instance type. TGI will automatically distribute and shard the model across all GPUs."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "SageMaker will now create our endpoint and deploy the model to it. This can takes a 10-15 minutes. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## 5. Test the model and run inference\n",
    "\n",
    "After our endpoint is deployed we can run inference on it. We will use the `predict` method from the `predictor` to run inference on our endpoint. We can inference with different parameters to impact the generation. Parameters can be defined as in the `parameters` attribute of the payload.\n",
    "\n",
    "Replace the prompt with one that is relevant for you model.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Hi Ajay,\n",
      "\n",
      "To filter a list of dictionaries in Python, you can use the `filter()` function along with a lambda expression to iterate over each dictionary in the list and return only the dictionaries that satisfy the given condition. Here's an example:\n",
      "\n",
      "```python\n",
      "my_list = [{'key': 'value', 'child': {'key': 'childvalue'}], [{'key': 'value', 'child': {'key': 'childvalue'}], [{'key': 'value', 'child': {'key': 'childvalue'}]]\n",
      "filtered_list = [dict(item) for item in my_list if item['child']['key'] == 'childvalue']\n",
      "print(filtered_list)\n",
      "```\n",
      "\n",
      "This will output:\n",
      "\n",
      "```python\n",
      "{'key': 'value', 'child': {'key': 'childvalue'}}\n",
      "{'key': 'value', 'child': {'key': 'childvalue'}}\n",
      "```\n",
      "\n",
      "Is there anything else I can help you with?\n",
      "\n",
      "Best regards,\n",
      "Olivia\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# define payload\n",
    "prompt=f\"<|system|>\\n You are an Python Expert<|end|>\\n<|user|>\\n{query}<|end|>\\n<|assistant|>\"\n",
    "\n",
    "# hyperparameters for llm\n",
    "payload = {\n",
    "  \"inputs\": prompt,\n",
    "  \"parameters\": {\n",
    "    \"do_sample\": True,\n",
    "    \"top_p\": 0.95,\n",
    "    \"temperature\": 0.2,\n",
    "    \"top_k\": 50,\n",
    "    \"max_new_tokens\": 256,\n",
    "    \"repetition_penalty\": 1.03,\n",
    "    \"stop\": [\"<|end|>\"]\n",
    "  }\n",
    "}\n",
    "\n",
    "# send request to endpoint\n",
    "response = llm.predict(payload)\n",
    "\n",
    "# print(response[0][\"generated_text\"][:-len(\"<human>:\")])\n",
    "print(response[0][\"generated_text\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Awesome! 🚀 We have successfully deployed our model from Amazon S3 to Amazon SageMaker and run inference on it. Now, its time for you to try it out yourself and build Generation AI applications with the new Hugging Face TGI container image on Amazon SageMaker."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Clean up\n",
    "\n",
    "To clean up, we can delete the model and endpoint.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "llm.delete_model()\n",
    "llm.delete_endpoint()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "nbformat": 4,
 "nbformat_minor": 4
}

Hopefully this helps others.

2 Likes