QLoRA trained LLaMA2 13B deployment error on Sagemaker using text generation inference image

I was following the article here to train the llama2 13b model.

But when I try to deploy with TGI container, I’m running into this error:

#033[2m2023-07-25T19:44:12.555090Z#033[0m #033[31mERROR#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Error when initializing model
"Traceback (most recent call last):
  File ""/opt/conda/bin/text-generation-server"", line 8, in <module>
    sys.exit(app())
  File ""/opt/conda/lib/python3.9/site-packages/typer/main.py"", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File ""/opt/conda/lib/python3.9/site-packages/click/core.py"", line 1130, in __call__
    return self.main(*args, **kwargs)
  File ""/opt/conda/lib/python3.9/site-packages/typer/core.py"", line 778, in main
    return _main(
  File ""/opt/conda/lib/python3.9/site-packages/typer/core.py"", line 216, in _main
    rv = self.invoke(ctx)
  File ""/opt/conda/lib/python3.9/site-packages/click/core.py"", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File ""/opt/conda/lib/python3.9/site-packages/click/core.py"", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File ""/opt/conda/lib/python3.9/site-packages/click/core.py"", line 760, in invoke
    return __callback(*args, **kwargs)
  File ""/opt/conda/lib/python3.9/site-packages/typer/main.py"", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File ""/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py"", line 67, in serve
    server.serve(model_id, revision, sharded, quantize, trust_remote_code, uds_path)
  File ""/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py"", line 155, in serve
    asyncio.run(serve_inner(model_id, revision, sharded, quantize, trust_remote_code))
  File ""/opt/conda/lib/python3.9/asyncio/runners.py"", line 44, in run
    return loop.run_until_complete(main)
  File ""/opt/conda/lib/python3.9/asyncio/base_events.py"", line 634, in run_until_complete
    self.run_forever()
  File ""/opt/conda/lib/python3.9/asyncio/base_events.py"", line 601, in run_forever
    self._run_once()
  File ""/opt/conda/lib/python3.9/asyncio/base_events.py"", line 1905, in _run_once
    handle._run()
  File ""/opt/conda/lib/python3.9/asyncio/events.py"", line 80, in _run
    self._context.run(self._callback, *self._args)"
"> File ""/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py"", line 124, in serve_inner
    model = get_model(model_id, revision, sharded, quantize, trust_remote_code)
  File ""/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py"", line 246, in get_model
    return llama_cls(
  File ""/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py"", line 58, in __init__
    filenames = weight_files(model_id, revision, "".bin"")
  File ""/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/hub.py"", line 86, in weight_files
    raise FileNotFoundError("
"FileNotFoundError: No local weights found in /opt/ml/model with extension .bin
 #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m"
#033[2m2023-07-25T19:44:13.212448Z#033[0m #033[31mERROR#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Shard 0 failed to start:
"Traceback (most recent call last):
  File ""/opt/conda/bin/text-generation-server"", line 8, in <module>
    sys.exit(app())
  File ""/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py"", line 67, in serve"
"Error: ShardCannotStart
    server.serve(model_id, revision, sharded, quantize, trust_remote_code, uds_path)
  File ""/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py"", line 155, in serve
    asyncio.run(serve_inner(model_id, revision, sharded, quantize, trust_remote_code))
  File ""/opt/conda/lib/python3.9/asyncio/runners.py"", line 44, in run
    return loop.run_until_complete(main)
  File ""/opt/conda/lib/python3.9/asyncio/base_events.py"", line 647, in run_until_complete
    return future.result()
  File ""/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py"", line 124, in serve_inner
    model = get_model(model_id, revision, sharded, quantize, trust_remote_code)
  File ""/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py"", line 246, in get_model
    return llama_cls(
  File ""/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py"", line 58, in __init__
    filenames = weight_files(model_id, revision, "".bin"")
  File ""/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/hub.py"", line 86, in weight_files
    raise FileNotFoundError("
FileNotFoundError: No local weights found in /opt/ml/model with extension .bin

I use this code to deploy

from sagemaker.huggingface import get_huggingface_llm_image_uri

# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(
  "huggingface",
  version="0.8.2"
)

# print ecr image uri
print(f"llm image uri: {llm_image}")


import json
from sagemaker.huggingface import HuggingFaceModel

# sagemaker config
instance_type = "ml.g5.2xlarge"
number_of_gpu = 1
health_check_timeout = 300

# Define Model and Endpoint configuration parameter
config = {
  'HF_MODEL_ID': "/opt/ml/model",
  'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
  'MAX_INPUT_LENGTH': json.dumps(2048),  # Max length of input text
  'MAX_TOTAL_TOKENS': json.dumps(4096),  # Max length of the generation (including input text)
  # 'HF_MODEL_QUANTIZE': "bitsandbytes", # comment in to quantize
}

# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
  model_data=f"s3://sagemaker-us-east-1-535772764458/huggingface-qlora-2023-07-24-21-27-30-2023-07-24-21-30-34-443/output/model.tar.gz",  # Change to your model path
  role=role,
  image_uri=llm_image,
  env=config
)

llm = llm_model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
  # volume_size=400, # If using an instance with local SSD storage, volume_size must be None, e.g. p4 but not p3
  container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
)

The trained model is made public in the s3 uri above in case that helps with debugging.

I haven’t been successful with the new llm hosting version “0.8.2” in Sagemaker. However, I managed to deploy the output trained model using custom inference code (def model_fn(model_dir) and predict_fn(data, model_and_tokenizer).

I opened a similar issues here:

If you find any solutions please let me know.

Thanks @Jorgeutd, do you mind sharing your inference.py file? Did you also override with a new requirements.txt? Now that my image is written to s3, how do I get a new inference.py into the tarball (other than downloading it, unzipping it, re-tar it with the new inference.py, which will be pretty time consuming)?

@philschmid since 0.8.2 is still expecting .bin files, is there any way to override during the deployment so it plays nicely with only safetensors file? And are there plans to push the 0.9.3 version of the tgi image any time soon?

Following example below, I’m trying to build 0.9.3 in the hopes that it would solve my issue with deploying the model with only .safetensor files.

How did you manage to build the docker image @malterei? I’ve tried build in on my M1 Macbook (which fails since it looks like the dockerfile has a hardcoded exit 1 when arch is arm64), as well as several instance types on ec2. All of them eventually gets stuck at this stage because it has consume all of the instance’s memory:

 => [vllm-builder 3/3] RUN make build-vllm                                             69.5s
 => => # python3.9/site-packages/torch/include/TH -I/opt/conda/lib/python3.9/site-packages/t
 => => # orch/include/THC -I/opt/conda/include -I/opt/conda/include/python3.9 -c -c /usr/src
 => => # /vllm/csrc/cache.cpp -o /usr/src/vllm/build/temp.linux-x86_64-cpython-39/csrc/cache
 => => # .o -g -O2 -std=c++17 -D_GLIBCXX_USE_CXX11_ABI=0 -DTORCH_API_INCLUDE_EXTENSION_H '-D
 => => # PYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_A
 => => # BI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=vllm_cache_ops -D_GLIBCXX_USE_CXX11_ABI=0
 => CACHED [planner 7/7] RUN cargo chef prepare --recipe-path recipe.json               0.0s
 => CACHED [builder  2/10] COPY --from=planner /usr/src/recipe.json recipe.json         0.0s
 => CACHED [builder  3/10] RUN cargo chef cook --release --recipe-path recipe.json      0.0s
 => CACHED [builder  4/10] COPY Cargo.toml Cargo.toml                                   0.0s
 => CACHED [builder  5/10] COPY rust-toolchain.toml rust-toolchain.toml                 0.0s
 => CACHED [builder  6/10] COPY proto proto                                             0.0s
 => CACHED [builder  7/10] COPY benchmark benchmark                                     0.0s
 => CACHED [builder  8/10] COPY router router                                           0.0s
 => CACHED [builder  9/10] COPY launcher launcher                                       0.0s
 => CACHED [builder 10/10] RUN cargo build --release                                    0.0s

I’ve tried building 0.9.3 and 0.9.2, both are resulting in some error when it hits build-vllm. Am I missing something?

%%writefile model/code/inference.py

lets deploy the falcon 7B model for inference

import torch
import torch.nn as nn
from transformers import AutoModelForCausalLM, AutoTokenizer
import os
import json
from peft import LoraConfig , get_peft_model , TaskType , prepare_model_for_int8_training , PeftConfig, PeftModel

def model_fn(model_dir):
# load model and processor from model_dir
model = AutoModelForCausalLM.from_pretrained(model_dir,
device_map=“auto”,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True)

tokenizer = AutoTokenizer.from_pretrained(model_dir)

return model, tokenizer

def predict_fn(data, model_and_tokenizer):
# unpack model and tokenizer
model, tokenizer = model_and_tokenizer

# process input
inputs = data.pop("inputs", data)
parameters = data.pop("parameters", None)

# preprocess
input_ids = tokenizer(inputs, return_tensors="pt").input_ids.to(model.device)

# pass inputs with all kwargs in data
if parameters is not None:
    outputs = model.generate(input_ids, **parameters)
else:
    outputs = model.generate(input_ids)

# postprocess the prediction
prediction = tokenizer.decode(outputs[0], skip_special_tokens=True)

return [{"generated_text": prediction}]

Yes, I changed the req file with new transformers version.

Can someone push up the the 0.9.3 image with --target sagemaker to dockerhub? I’m unable to build it on my local machine.

Yesterday night the new iamge got released: Release v1.0-hf-tgi-0.9.3-pt-2.0.1-inf-gpu-py39 · aws/deep-learning-containers · GitHub
Its not yet available in the sagemaker-sdk but you can use the URI directly.

2 Likes

Thank you @philschmid for all your work.

With this new release all the issues regarding deploying fine tuned models with LLM Inference image will be fix it?

1 Like

Hi,
I am using the config @rycfung used but with deploying finetuned llama-13b (QLoRA-merged) on 4xlarge instance.

I am getting cudaOutOfMemoryError and I got confused why that happen. Because I did use 4xlarge on the finetuning, and it is really mysterious that why would the inference require larger gpu cluster than the finetuning itself where it is more computationally expensive?

2023-08-05T10:17:44.741-05:00

Copy
torch.cuda.OutOfMemoryError: Allocation on device 0 would exceed allowed memory. (out of memory)
torch.cuda.OutOfMemoryError: Allocation on device 0 would exceed allowed memory. (out of memory)

2023-08-05T10:17:44.741-05:00 Currently allocated : 21.13 GiB

2023-08-05T10:17:44.741-05:00 Requested : 150.00 MiB

2023-08-05T10:17:44.741-05:00 Device limit : 22.20 GiB

2023-08-05T10:17:44.741-05:00 Free (according to CUDA): 25.12 MiB

2023-08-05T10:17:46.246-05:00 PyTorch limit (set by user-supplied memory fraction) : 17179869184.00 GiB

I’m pretty sure it’s because with LoRA techniques, you only need to load the smaller matrix of weights that are being fine-tuned on to the GPUs, since only those weights are being updated. However, during inference all of the weights need to be loaded on to GPUs.

@elodium I ended up building the 0.9.3 image from scratch (which was half a day of work between actual compiling building and figuring out some build configurations to make the build stop freezing/blowing all of the memory even with 100GB of RAM on EC2).

I ended up deploying with the TGI 0.9.3 image I built on a g5.4xlarge, and it worked. Only issue was even though I deployed the 4-bit QLoRA LLaMA 2 13B, generation was pretty slow, and often freeze on the sagemaker deployment or times out after 30s. That was super odd because I was expecting the 4bit quantized 13B to breeze thru the generation on a 4xlarge.

Going to try the official 093 image to see if there’s a difference.

2 Likes

@rycfung I use the tgi 0.9 and not specific to 0.9.3 and i have successfully deployed my finetuned model on ml.g5.12xlarge endpoint. Have you resolved the issue?

Yes, I was able to deploy with 093, was pretty slow to generate though. Wonder what the generation speed was like for you @elodium.

Im running into the same issue… but running into the error even when using a ml.g5.12xlarge… Deploy Llama 2 7B/13B/70B on Amazon SageMaker

Is there a setting we’re missing that is preventing us from using the smaller instances @philschmid ?

(I am downloading the model from HF and loading locally because my endpoint needs to have network isolation)

1 Like

Nevermind, this was fixed, by either re-cloning the model repo or including of the ‘MAX_BATCH_TOTAL_TOKENS’: json.dumps(8192) parameter in the model config.

1 Like