Deploying Fine-Tune Falcon 40B with QLoRA on Sagemaker Inference Error

malterei · July 21, 2023, 6:06pm

Hi,

Note: I can only post 2 links because my account is new. See the next posts for all the missing links.
We have successfully deployed a fine-tuned Falcon 7B & Falcon 7B instruct model to an Amazon SageMaker inference endpoint.

The scripts/requirements.txt are the same as in @philschmid’s update.

In the scripts/run_clm.py file we added

modules_to_save=["lm_head"] to the LoraConfig below line 79
trust_remote_code=True to the AutoPeftModelForCausalLM.from_pretrained parameters in line 163.

In the 28_train_llms_with_qlora/sagemaker-notebook.ipynb we added

"model_revision": "2f5c3cd4eace6be6c0f12981f377fb35e5bf6ee5" to the hyperparameters dict for Falcon Instruct 7B in line 312. Use the model revision for your Falcon model type that is newer than the Revert in-library PR.

We trained the model with the above changes.

For the deployment we built the latest v0.9.3 release of the Hugging Face text generation inference container image. Build your own container image and push it to an Amazon Elastic Container Registry repository.

git clone -b v0.9.3 [https://]]github.com/huggingface/text-generation-inference.git
cd text-generation-inference
docker build -t <YOUR_AWS_ACCOUNT_ID>.dkr.ecr.<YOUR_AWS_REGION>.amazonaws.com/<YOUR_ECR_REPO>:0.9.3 --target sagemaker .
docker push <YOUR_AWS_ACCOUNT_ID>.dkr.ecr.<YOUR_AWS_REGION>.amazonaws.com/<YOUR_ECR_REPO>:0.9.3

Our Notebook to deploy the model looks like this:

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Deploy Large Language Models (LLMs) to Amazon SageMaker using Hugging Face Text Generation Inference Container\n",
    "\n",
    "This is an example on how to deploy the open-source LLMs to Amazon SageMaker for inference using your own build of the Hugging Face TGI container.\n",
    "\n",
    "This examples demonstrate how to deploy a fine-tuned model from Amazon S3 to Amazon SageMaker.\n",
    "\n",
    "If you want to learn more about the Hugging Face TGI container check out the Hugging Face TGI GitHub repository. Lets get started!\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Setup development environment\n",
    "\n",
    "We are going to use the `sagemaker` python SDK to deploy to Amazon SageMaker. We need to make sure to have an AWS account configured and the `sagemaker` python SDK installed. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "!pip install \"sagemaker==2.163.0\" \"huggingface_hub\" \"hf-transfer\" --upgrade --quiet"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import sagemaker\n",
    "import boto3\n",
    "sess = sagemaker.Session()\n",
    "\n",
    "\n",
    "# sagemaker session bucket ->| used for uploading data, models and logs\n",
    "# sagemaker will automatically create this bucket if it not exists\n",
    "sagemaker_session_bucket=None\n",
    "if sagemaker_session_bucket is None and sess is not None:\n",
    "    # set to default bucket if a bucket name is not given\n",
    "    sagemaker_session_bucket = sess.default_bucket()\n",
    "\n",
    "try:\n",
    "    role = sagemaker.get_execution_role()\n",
    "except ValueError:\n",
    "    iam = boto3.client('iam')\n",
    "    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']\n",
    "\n",
    "sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)\n",
    "\n",
    "print(f\"sagemaker role arn: {role}\")\n",
    "print(f\"sagemaker session region: {sess.boto_region_name}\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "s3_model_uri = \"<Amazon S3 URI that contains the model.tar.gz of your fine-tuned model>\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Retrieve your HuggingFace TGI container image\n",
    "\n",
    "Compared to deploying regular Hugging Face models we first need to retrieve the container uri and provide it to our `HuggingFaceModel` model class with a `image_uri` pointing to the image.\n",
    "At the time of writing the Hugging Face TGI container image for Amazon SageMake is on version 0.8.2. Version 0.8.2 did not work for us. So we've built our own TGI container image for sagemaker stage docker build from the latest Hugging Face TGI GitHub branch v0.9.3 and pushed the container image to a private Amazon Elastic Container Registry repo.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "llm image uri: 843197046435.dkr.ecr.eu-west-1.amazonaws.com/huggingface/text-generation-inference:0.9.2\n"
     ]
    }
   ],
   "source": [
    "llm_image = \"<YOUR_AWS_ACCOUNT_ID>.dkr.ecr.<YOUR_AWS_REGION>.amazonaws.com/<YOUR_ECR_REPO>:0.9.2\"\n",
    "# print ecr image uri\n",
    "print(f\"llm image uri: {llm_image}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Deploy finetuned-model to Amazon SageMaker\n",
    "\n",
    "To deploy your model to Amazon SageMaker we create a `HuggingFaceModel` model class and define our endpoint configuration including the `hf_model_id`, `instance_type` etc. We will use a `g5.12xlarge` instance type, which has 4 NVIDIA A10G GPUs and 96GB of GPU memory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import json\n",
    "from sagemaker.huggingface import HuggingFaceModel\n",
    "\n",
    "# sagemaker config\n",
    "instance_type = \"ml.g5.2xlarge\"\n",
    "number_of_gpu = 1\n",
    "health_check_timeout = 300\n",
    "\n",
    "# Define Model and Endpoint configuration parameter\n",
    "config = {\n",
    "  'HF_MODEL_ID': \"/opt/ml/model\", # path to where sagemaker stores the mode\n",
    "  'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica\n",
    "  'MAX_INPUT_LENGTH': json.dumps(1024), # Max length of input text\n",
    "  'MAX_TOTAL_TOKENS': json.dumps(2048), # Max length of the generation (including input text)\n",
    "  # 'HF_MODEL_QUANTIZE': \"bitsandbytes\",# Comment in to quantize\n",
    "}\n",
    "\n",
    "# create HuggingFaceModel with the image uri\n",
    "llm_model = HuggingFaceModel(\n",
    "  role=role,\n",
    "  image_uri=llm_image,\n",
    "  model_data=s3_model_uri,\n",
    "  env=config\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "---------!"
     ]
    }
   ],
   "source": [
    "# Deploy model to an endpoint\n",
    "llm = llm_model.deploy(\n",
    "  initial_instance_count=1,\n",
    "  instance_type=instance_type,\n",
    "  # volume_size=400, # If using an instance with local SSD storage, volume_size must be None, e.g. p4 but not p3\n",
    "  container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model\n",
    ")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After we have created the `HuggingFaceModel` we can deploy it to Amazon SageMaker using the `deploy` method. We will deploy the model with the `ml.g5.12xlarge` instance type. TGI will automatically distribute and shard the model across all GPUs."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "SageMaker will now create our endpoint and deploy the model to it. This can takes a 10-15 minutes. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## 5. Test the model and run inference\n",
    "\n",
    "After our endpoint is deployed we can run inference on it. We will use the `predict` method from the `predictor` to run inference on our endpoint. We can inference with different parameters to impact the generation. Parameters can be defined as in the `parameters` attribute of the payload.\n",
    "\n",
    "Replace the prompt with one that is relevant for you model.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Hi Ajay,\n",
      "\n",
      "To filter a list of dictionaries in Python, you can use the `filter()` function along with a lambda expression to iterate over each dictionary in the list and return only the dictionaries that satisfy the given condition. Here's an example:\n",
      "\n",
      "```python\n",
      "my_list = [{'key': 'value', 'child': {'key': 'childvalue'}], [{'key': 'value', 'child': {'key': 'childvalue'}], [{'key': 'value', 'child': {'key': 'childvalue'}]]\n",
      "filtered_list = [dict(item) for item in my_list if item['child']['key'] == 'childvalue']\n",
      "print(filtered_list)\n",
      "```\n",
      "\n",
      "This will output:\n",
      "\n",
      "```python\n",
      "{'key': 'value', 'child': {'key': 'childvalue'}}\n",
      "{'key': 'value', 'child': {'key': 'childvalue'}}\n",
      "```\n",
      "\n",
      "Is there anything else I can help you with?\n",
      "\n",
      "Best regards,\n",
      "Olivia\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# define payload\n",
    "prompt=f\"<|system|>\\n You are an Python Expert<|end|>\\n<|user|>\\n{query}<|end|>\\n<|assistant|>\"\n",
    "\n",
    "# hyperparameters for llm\n",
    "payload = {\n",
    "  \"inputs\": prompt,\n",
    "  \"parameters\": {\n",
    "    \"do_sample\": True,\n",
    "    \"top_p\": 0.95,\n",
    "    \"temperature\": 0.2,\n",
    "    \"top_k\": 50,\n",
    "    \"max_new_tokens\": 256,\n",
    "    \"repetition_penalty\": 1.03,\n",
    "    \"stop\": [\"<|end|>\"]\n",
    "  }\n",
    "}\n",
    "\n",
    "# send request to endpoint\n",
    "response = llm.predict(payload)\n",
    "\n",
    "# print(response[0][\"generated_text\"][:-len(\"<human>:\")])\n",
    "print(response[0][\"generated_text\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Awesome! 🚀 We have successfully deployed our model from Amazon S3 to Amazon SageMaker and run inference on it. Now, its time for you to try it out yourself and build Generation AI applications with the new Hugging Face TGI container image on Amazon SageMaker."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Clean up\n",
    "\n",
    "To clean up, we can delete the model and endpoint.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "llm.delete_model()\n",
    "llm.delete_endpoint()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "nbformat": 4,
 "nbformat_minor": 4
}

Hopefully this helps others.

Topic		Replies	Views
QLoRA trained Mixtral 8x7B deployment error on Sagemaker using text generation inference image Amazon SageMaker	0	305	April 10, 2024
Falcon 40B instruct training with QLora, Sagemaker model artifact location Amazon SageMaker	3	399	September 21, 2023
Sagemaker Pipelines with fintuned llama2 Amazon SageMaker	0	851	September 12, 2023
VRAM Usage Differences in SageMaker Training Jobs vs. Direct Instance for Fine-Tuning LLama3 8B with QLoRA Amazon SageMaker	0	61	October 18, 2024
Llama2 fine-tunning with PEFT QLora and testing the model 🤗Transformers	13	15206	December 21, 2023

Deploying Fine-Tune Falcon 40B with QLoRA on Sagemaker Inference Error

Related topics