Deploying Fine-Tune Falcon 40B with QLoRA on Sagemaker Inference Error

Jorgeutd · July 14, 2023, 3:59pm

Hi team,

I was able to fine tune successfully the Falcon model following the instructions on this notebook:

huggingface/notebooks/blob/main/sagemaker/28_train_llms_with_qlora/sagemaker-notebook.ipynb

{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Train LLMs using QLoRA on Amazon SageMaker\n",
    "\n",
    "In this sagemaker example, we are going to learn how to apply [QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314) \n",
    "to fine-tune Falcon 40B. QLoRA is an efficient finetuning technique that quantizes a pretrained language model to 4 bits and attaches small “Low-Rank Adapters” which are fine-tuned. This enables fine-tuning of models with up to 65 billion parameters on a single GPU; despite its efficiency, QLoRA matches the performance of full-precision fine-tuning and achieves state-of-the-art results on language tasks.\n",
    "\n",
    "In our example, we are going to leverage Hugging Face [Transformers](https://huggingface.co/docs/transformers/index), [Accelerate](https://huggingface.co/docs/accelerate/index), and [PEFT](https://github.com/huggingface/peft). \n",
    "\n",
    "In Detail you will learn how to:\n",
    "1. Setup Development Environment\n",
    "2. Load and prepare the dataset\n",
    "3. Fine-Tune Falcon 40B with QLoRA on Amazon SageMaker\n",
    "\n",
    "### Quick intro: PEFT or Parameter Efficient Fine-tuning\n",

This file has been truncated. show original

Then I tried to deploy that trained model following what it was recommended on the next steps section as below using the new Hugging Face LLM Inference Container:

Check out the Deploy Falcon 7B & 40B on Amazon SageMaker and Securely deploy LLMs inside VPCs with Hugging Face and Amazon SageMaker for more details.

This was my deployment code:

import json
from sagemaker.huggingface import HuggingFaceModel

sagemaker config

instance_type = “ml.g5.12xlarge”
number_of_gpu = 4
health_check_timeout = 300

Define Model and Endpoint configuration parameter

config = {
‘HF_MODEL_ID’: “/opt/ml/model”, # path to where sagemaker stores the model
‘SM_NUM_GPUS’: json.dumps(number_of_gpu), # Number of GPU used per replica
‘MAX_INPUT_LENGTH’: json.dumps(1024), # Max length of input text
‘MAX_TOTAL_TOKENS’: json.dumps(2048), # Max length of the generation (including input text)

‘HF_MODEL_QUANTIZE’: “bitsandbytes”,# Comment in to quantize

}

create HuggingFaceModel with the image uri

llm_model = HuggingFaceModel(
role=role,
image_uri=llm_image,
model_data=s3_model_uri,
env=config
)

I got the following error on the logs:

message
#033[2m2023-07-14T14:28:45.972834Z#033[0m #033[32m INFO#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Convert /opt/ml/model/pytorch_model-00001-of-00009.bin to /opt/ml/model/model-00001-of-00009.safetensors.
Error: DownloadError
“#033[2m2023-07-14T14:28:55.491641Z#033[0m #033[31mERROR#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download encountered an error: Traceback (most recent call last):
File “”/opt/conda/bin/text-generation-server”“, line 8, in
sys.exit(app())
File “”/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py”“, line 151, in download_weights
utils.convert_files(local_pt_files, local_st_files)
File “”/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/convert.py”“, line 84, in convert_files
convert_file(pt_file, sf_file)
File “”/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/convert.py”“, line 62, in convert_file
save_file(pt_state, str(sf_file), metadata={”“format”“: ““pt””})
File “”/opt/conda/lib/python3.9/site-packages/safetensors/torch.py”“, line 232, in save_file
serialize_file(_flatten(tensors), filename, metadata=metadata)”
“safetensors_rust.SafetensorError: Error while serializing: IoError(Os { code: 30, kind: ReadOnlyFilesystem, message: ““Read-only file system”” })”
“#033[2m2023-07-14T14:28:57.416297Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Args { model_id: “”/opt/ml/model”“, revision: None, sharded: None, num_shard: Some(4), quantize: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, max_batch_size: None, waiting_served_ratio: 1.2, max_batch_total_tokens: 32000, max_waiting_tokens: 20, port: 8080, shard_uds_path: “”/tmp/text-generation-server”“, master_addr: ““localhost””, master_port: 29500, huggingface_hub_cache: Some(”“/tmp”“), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: , watermark_gamma: None, watermark_delta: None, env: false }”
#033[2m2023-07-14T14:28:57.416332Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Sharding model on 4 processes
#033[2m2023-07-14T14:28:57.416401Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting download process.
#033[2m2023-07-14T14:29:00.968073Z#033[0m #033[33m WARN#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m No safetensors weights found for model /opt/ml/model at revision None. Converting PyTorch weights to safetensors.
#033[2m2023-07-14T14:29:00.968114Z#033[0m #033[32m INFO#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Convert /opt/ml/model/pytorch_model-00001-of-00009.bin to /opt/ml/model/model-00001-of-00009.safetensors.
“#033[2m2023-07-14T14:29:10.426801Z#033[0m #033[31mERROR#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download encountered an error: Traceback (most recent call last):
File “”/opt/conda/bin/text-generation-server”“, line 8, in
sys.exit(app())
File “”/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py”“, line 151, in download_weights
utils.convert_files(local_pt_files, local_st_files)
File “”/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/convert.py”“, line 84, in convert_files
convert_file(pt_file, sf_file)
File “”/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/convert.py”“, line 62, in convert_file
save_file(pt_state, str(sf_file), metadata={”“format”“: ““pt””})
File “”/opt/conda/lib/python3.9/site-packages/safetensors/torch.py”“, line 232, in save_file
serialize_file(_flatten(tensors), filename, metadata=metadata)”
“safetensors_rust.SafetensorError: Error while serializing: IoError(Os { code: 30, kind: ReadOnlyFilesystem, message: ““Read-only file system”” })”
Error: DownloadError
“#033[2m2023-07-14T14:29:12.416639Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Args { model_id: “”/opt/ml/model”“, revision: None, sharded: None, num_shard: Some(4), quantize: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, max_batch_size: None, waiting_served_ratio: 1.2, max_batch_total_tokens: 32000, max_waiting_tokens: 20, port: 8080, shard_uds_path: “”/tmp/text-generation-server”“, master_addr: ““localhost””, master_port: 29500, huggingface_hub_cache: Some(”“/tmp”“), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: , watermark_gamma: None, watermark_delta: None, env: false }”
#033[2m2023-07-14T14:29:12.416671Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Sharding model on 4 processes
#033[2m2023-07-14T14:29:12.416743Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting download process.
#033[2m2023-07-14T14:29:15.984028Z#033[0m #033[33m WARN#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m No safetensors weights found for model /opt/ml/model at revision None. Converting PyTorch weights to safetensors.
#033[2m2023-07-14T14:29:15.984072Z#033[0m #033[32m INFO#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Convert /opt/ml/model/pytorch_model-00001-of-00009.bin to /opt/ml/model/model-00001-of-00009.safetensors.
“#033[2m2023-07-14T14:29:25.527784Z#033[0m #033[31mERROR#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download encountered an error: Traceback (most recent call last):
File “”/opt/conda/bin/text-generation-server”“, line 8, in
sys.exit(app())”
“Error: DownloadError
File “”/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py”“, line 151, in download_weights
utils.convert_files(local_pt_files, local_st_files)
File “”/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/convert.py”“, line 84, in convert_files
convert_file(pt_file, sf_file)
File “”/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/convert.py”“, line 62, in convert_file
save_file(pt_state, str(sf_file), metadata={”“format”“: ““pt””})
File “”/opt/conda/lib/python3.9/site-packages/safetensors/torch.py”“, line 232, in save_file
serialize_file(_flatten(tensors), filename, metadata=metadata)”
“safetensors_rust.SafetensorError: Error while serializing: IoError(Os { code: 30, kind: ReadOnlyFilesystem, message: ““Read-only file system”” })”
“#033[2m2023-07-14T14:29:27.432427Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Args { model_id: “”/opt/ml/model”“, revision: None, sharded: None, num_shard: Some(4), quantize: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, max_batch_size: None, waiting_served_ratio: 1.2, max_batch_total_tokens: 32000, max_waiting_tokens: 20, port: 8080, shard_uds_path: “”/tmp/text-generation-server”“, master_addr: ““localhost””, master_port: 29500, huggingface_hub_cache: Some(”“/tmp”“), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: , watermark_gamma: None, watermark_delta: None, env: false }”
#033[2m2023-07-14T14:29:27.432464Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Sharding model on 4 processes
#033[2m2023-07-14T14:29:27.432533Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting download process.
#033[2m2023-07-14T14:29:30.973642Z#033[0m #033[33m WARN#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m No safetensors weights found for model /opt/ml/model at revision None. Converting PyTorch weights to safetensors.
#033[2m2023-07-14T14:29:30.973765Z#033[0m #033[32m INFO#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Convert /opt/ml/model/pytorch_model-00001-of-00009.bin to /opt/ml/model/model-00001-of-00009.safetensors.
“#033[2m2023-07-14T14:29:40.442511Z#033[0m #033[31mERROR#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download encountered an error: Traceback (most recent call last):
File “”/opt/conda/bin/text-generation-server”“, line 8, in
sys.exit(app())
File “”/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py”“, line 151, in download_weights
utils.convert_files(local_pt_files, local_st_files)
File “”/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/convert.py”“, line 84, in convert_files”
“Error: DownloadError
convert_file(pt_file, sf_file)
File “”/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/convert.py”“, line 62, in convert_file
save_file(pt_state, str(sf_file), metadata={”“format”“: ““pt””})
File “”/opt/conda/lib/python3.9/site-packages/safetensors/torch.py”“, line 232, in save_file
serialize_file(_flatten(tensors), filename, metadata=metadata)”
“safetensors_rust.SafetensorError: Error while serializing: IoError(Os { code: 30, kind: ReadOnlyFilesystem, message: ““Read-only file system”” })”
“#033[2m2023-07-14T14:29:42.400590Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Args { model_id: “”/opt/ml/model”“, revision: None, sharded: None, num_shard: Some(4), quantize: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, max_batch_size: None, waiting_served_ratio: 1.2, max_batch_total_tokens: 32000, max_waiting_tokens: 20, port: 8080, shard_uds_path: “”/tmp/text-generation-server”“, master_addr: ““localhost””, master_port: 29500, huggingface_hub_cache: Some(”“/tmp”“), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: , watermark_gamma: None, watermark_delta: None, env: false }”
#033[2m2023-07-14T14:29:42.400620Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Sharding model on 4 processes
#033[2m2023-07-14T14:29:42.400689Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting download process.
#033[2m2023-07-14T14:29:45.974179Z#033[0m #033[33m WARN#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m No safetensors weights found for model /opt/ml/model at revision None. Converting PyTorch weights to safetensors.
#033[2m2023-07-14T14:29:45.974317Z#033[0m #033[32m INFO#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Convert /opt/ml/model/pytorch_model-00001-of-00009.bin to /opt/ml/model/model-00001-of-00009.safetensors.
“#033[2m2023-07-14T14:29:55.512021Z#033[0m #033[31mERROR#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download encountered an error: Traceback (most recent call last):
File “”/opt/conda/bin/text-generation-server”“, line 8, in
sys.exit(app())
File “”/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py”“, line 151, in download_weights
utils.convert_files(local_pt_files, local_st_files)”
“Error: DownloadError
File “”/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/convert.py”“, line 84, in convert_files
convert_file(pt_file, sf_file)
File “”/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/convert.py”“, line 62, in convert_file
save_file(pt_state, str(sf_file), metadata={”“format”“: ““pt””})
File “”/opt/conda/lib/python3.9/site-packages/safetensors/torch.py”“, line 232, in save_file
serialize_file(_flatten(tensors), filename, metadata=metadata)”
“safetensors_rust.SafetensorError: Error while serializing: IoError(Os { code: 30, kind: ReadOnlyFilesystem, message: ““Read-only file system”” })”
“#033[2m2023-07-14T14:29:57.412895Z#033[0m #033[32m INFO#033[0m #
sys.exit(app())
File “”/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py”“, line 151, in download_weights
utils.convert_files(local_pt_files, local_st_files)
File “”/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/convert.py”“, line 84, in convert_files
convert_file(pt_file, sf_file)
File “”/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/convert.py”“, line 62, in convert_file
save_file(pt_state, str(sf_file), metadata={”“format”“: ““pt””})
File “”/opt/conda/lib/python3.9/site-packages/safetensors/torch.py”“, line 232, in save_file”
“Error: DownloadError
serialize_file(_flatten(tensors), filename, metadata=metadata)”
“safetensors_rust.SafetensorError: Error while serializing: IoError(Os { code: 30, kind: ReadOnlyFilesystem, message: ““Read-only file system”” })”
“#033[2m2023-07-14T14:34:57.372838Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Args { model_id: “”/opt/ml/model”“, revision: None, sharded: None, num_shard: Some(4), quantize: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, max_batch_size: None, waiting_served_ratio: 1.2, max_batch_total_tokens: 32000, max_waiting_tokens: 20, port: 8080, shard_uds_path: “”/tmp/text-generation-server”“, master_addr: ““localhost””, master_port: 29500, huggingface_hub_cache: Some(”“/tmp”“), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: , watermark_gamma: None, watermark_delta: None, env: false }”
#033[2m2023-07-14T14:34:57.372870Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Sharding model on 4 processes
#033[2m2023-07-14T14:34:57.372945Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting download process.
#033[2m2023-07-14T14:35:00.966247Z#033[0m #033[33m WARN#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m No safetensors weights found for model /opt/ml/model at revision None. Converting PyTorch weights to safetensors.

Any advice to fix this? Thank you.

monuirctc · July 15, 2023, 9:32am

Even i am getting the same error.

Jorgeutd · July 15, 2023, 10:33am

This might be a question for @philschmid

GenieKanth · July 17, 2023, 4:41pm

I am facing the same issue as well. It is not able to access the folder. Not sure if tensors should be part of model artifacts. @philschmid, could you please help suggest on ways to resolve this. It will be helpful. Thank you!

ellisonbg · July 18, 2023, 12:54am

I think I understand what is happening here. Recently HF started to use safetensors for serializing model weights. If the model serving code encounters non-safetensor weights, it will attempt to use this code to convert it to safetensors: https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/utils/convert.py

The resulting safetensor weights are then written to the local file system.

On SageMaker the /opt/mp/model directory is read-only so the writing of the new weights fails. I will report this to the relevant SageMaker team.

Ravi07bec · July 18, 2023, 4:22am

This is helpful @ellisonbg. Any suggestions on how to bypass this safetensor conversion and deploy from s3(.tar.gz) directly?

vkajjam · July 18, 2023, 4:22am

I am encountering the same issue when Fine-Tuning Falcon 7B.

Exception:

#033[2m2023-07-18T03:17:05.812589Z#033[0m #033[31mERROR#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download encountered an error: Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 151, in download_weights
    utils.convert_files(local_pt_files, local_st_files)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/convert.py", line 84, in convert_files
    convert_file(pt_file, sf_file)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/convert.py", line 62, in convert_file
    save_file(pt_state, str(sf_file), metadata={
    "format": "pt"
})
  File "/opt/conda/lib/python3.9/site-packages/safetensors/torch.py", line 232, in save_file
    serialize_file(_flatten(tensors), filename, metadata=metadata)

 safetensors_rust.SafetensorError: Error while serializing: IoError(Os { code: 30, kind: ReadOnlyFilesystem, message: "Read-only file system" })

Any pointers or work arounds will be greatly appreciated.

Thanks.

philschmid · July 18, 2023, 8:51am

Hello,

We updated the script and requirements to merge and save the weights in the safetensors format when training. Meaning there is no conversion needed on the LLM inference container side.

You can update the script and requirements, and it should work when deploying after training.

monuirctc · July 18, 2023, 12:55pm

started training now with updated script . thanks a lot

arlind0x11 · July 19, 2023, 7:42am

Thanks for improvement Philipp.

While this allows the training to complete, the deployment using latest TGI container 0.9.2 does not seem to be able to load the model:

RuntimeError: weight lm_head.weight does not exist

Any suggestion on how to resolve this problem when deploying the finetuned model using the latest TGI container?

philschmid · July 19, 2023, 8:14am

We didn’t release 0.9.2 on SageMaker yet.

malterei · July 19, 2023, 11:07am

Hi @philschmid,

when trying to deploy the model fine-tuned with you latest update on Amazon SageMaker endpoints using TGI 0.8.2 we get the following error:

Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 67, in serve
    server.serve(model_id, revision, sharded, quantize, trust_remote_code, uds_path)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 155, in serve
    asyncio.run(serve_inner(model_id, revision, sharded, quantize, trust_remote_code))
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)

> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 124, in serve_inner
    model = get_model(model_id, revision, sharded, quantize, trust_remote_code)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 220, in get_model
    return FlashRW(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_rw.py", line 59, in __init__
    filenames = weight_files(model_id, revision, ".bin")
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/hub.py", line 86, in weight_files
    raise FileNotFoundError(

FileNotFoundError: No local weights found in /opt/ml/model with extension .bin
 #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m

We built our own container image based on the TGI 0.9.2 we get a few steps further and then it fails with the lm_head.weights error that Arlind mentioned above. The stack trace for that error is:

Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
    server.serve(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 175, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)

Error: ShardCannotStart
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 142, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 253, in get_model
    return FlashRWSharded(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_rw.py", line 62, in __init__
    model = FlashRWForCausalLM(config, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 636, in __init__
    self.lm_head = TensorParallelHead.load(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 194, in load
    weight = weights.get_tensor(f"{prefix}.weight")
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 62, in get_tensor
    filename, tensor_name = self.get_filename(tensor_name)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 49, in get_filename
    raise RuntimeError(f"weight {tensor_name} does not exist")

Do you have any idea what a the cause could be?

Jorgeutd · July 19, 2023, 12:58pm

Hi @philschmid thanks for the updated req and script.

I never had any problem with the training aspect, my issue always has been with the deployment.

I updated the req file and run_clm.py to use safe_serialization=True; but I am still getting some errors on the deployment step.

Here is the error on the logs:

#033[2m2023-07-19T12:42:57.135043Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Args { model_id: “/opt/ml/model”, revision: None, sharded: None, num_shard: Some(4), quantize: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, max_batch_size: None, waiting_served_ratio: 1.2, max_batch_total_tokens: 32000, max_waiting_tokens: 20, port: 8080, shard_uds_path: “/tmp/text-generation-server”, master_addr: “localhost”, master_port: 29500, huggingface_hub_cache: Some(“/tmp”), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: , watermark_gamma: None, watermark_delta: None, env: false }
#033[2m2023-07-19T12:42:57.135074Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Sharding model on 4 processes
#033[2m2023-07-19T12:42:57.135162Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting download process.
#033[2m2023-07-19T12:43:01.254896Z#033[0m #033[32m INFO#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Files are already present on the host. Skipping download.
#033[2m2023-07-19T12:43:02.239966Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Successfully downloaded weights.
#033[2m2023-07-19T12:43:02.240163Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting shard 0
#033[2m2023-07-19T12:43:02.240794Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting shard 1
#033[2m2023-07-19T12:43:02.240867Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting shard 2
#033[2m2023-07-19T12:43:02.240866Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting shard 3
#033[2m2023-07-19T12:43:12.249841Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Waiting for shard 0 to be ready…
#033[2m2023-07-19T12:43:12.250390Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Waiting for shard 3 to be ready…
#033[2m2023-07-19T12:43:12.250390Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Waiting for shard 1 to be ready…
#033[2m2023-07-19T12:43:12.251351Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Waiting for shard 2 to be ready…
#033[2m2023-07-19T12:43:18.737185Z#033[0m #033[31mERROR#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Error when initializing model
Traceback (most recent call last):
File “/opt/conda/bin/text-generation-server”, line 8, in
sys.exit(app())
File “/opt/conda/lib/python3.9/site-packages/typer/main.py”, line 311, in call
return get_command(self)(*args, **kwargs)
File “/opt/conda/lib/python3.9/site-packages/click/core.py”, line 1130, in call
return self.main(*args, **kwargs)
File “/opt/conda/lib/python3.9/site-packages/typer/core.py”, line 778, in main
return _main(
File “/opt/conda/lib/python3.9/site-packages/typer/core.py”, line 216, in _main
rv = self.invoke(ctx)
File “/opt/conda/lib/python3.9/site-packages/click/core.py”, line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File “/opt/conda/lib/python3.9/site-packages/click/core.py”, line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File “/opt/conda/lib/python3.9/site-packages/click/core.py”, line 760, in invoke
return __callback(*args, **kwargs)
File “/opt/conda/lib/python3.9/site-packages/typer/main.py”, line 683, in wrapper
return callback(**use_params) # type: ignore
File “/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py”, line 67, in serve
server.serve(model_id, revision, sharded, quantize, trust_remote_code, uds_path)
File “/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py”, line 155, in serve
asyncio.run(serve_inner(model_id, revision, sharded, quantize, trust_remote_code))
File “/opt/conda/lib/python3.9/asyncio/runners.py”, line 44, in run
return loop.run_until_complete(main)
File “/opt/conda/lib/python3.9/asyncio/base_events.py”, line 634, in run_until_complete
self.run_forever()
File “/opt/conda/lib/python3.9/asyncio/base_events.py”, line 601, in run_forever
self._run_once()
File “/opt/conda/lib/python3.9/asyncio/base_events.py”, line 1905, in _run_once
handle._run()
File “/opt/conda/lib/python3.9/asyncio/events.py”, line 80, in _run
self._context.run(self._callback, *self._args)

File “/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py”, line 124, in serve_inner
model = get_model(model_id, revision, sharded, quantize, trust_remote_code)
File “/opt/conda/lib/python3.9/site-packages/text_generation_server/models/init.py”, line 209, in get_model
return FlashRWSharded(
File “/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_rw.py”, line 161, in init
model=model.to(device),
File “/usr/src/transformers/src/transformers/modeling_utils.py”, line 1903, in to
return super().to(*args, **kwargs)
File “/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 1145, in to
return self._apply(convert)
File “/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 797, in _apply
module._apply(fn)
File “/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 820, in _apply
param_applied = fn(param)
File “/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 1143, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
NotImplementedError: Cannot copy out of meta tensor; no data!
#033[2m#033[3mrank#033[0m#033[2m=#033[0m3#033[0m
#033[2m2023-07-19T12:43:18.739852Z#033[0m #033[31mERROR#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Error when initializing model
Traceback (most recent call last):
File “/opt/conda/bin/text-generation-server”, line 8, in
sys.exit(app())
File “/opt/conda/lib/python3.9/site-packages/typer/main.py”, line 311, in call
return get_command(self)(*args, **kwargs)
File “/opt/conda/lib/python3.9/site-packages/click/core.py”, line 1130, in call
return self.main(*args, **kwargs)
File “/opt/conda/lib/python3.9/site-packages/typer/core.py”, line 778, in main
return _main(
File “/opt/conda/lib/python3.9/site-packages/typer/core.py”, line 216, in _main
rv = self.invoke(ctx)
File “/opt/conda/lib/python3.9/site-packages/click/core.py”, line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File “/opt/conda/lib/python3.9/site-packages/click/core.py”, line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File “/opt/conda/lib/python3.9/site-packages/click/core.py”, line 760, in invoke
return __callback(*args, **kwargs)
File “/opt/conda/lib/python3.9/site-packages/typer/main.py”, line 683, in wrapper
return callback(**use_params) # type: ignore
File “/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py”, line 67, in serve
server.serve(model_id, revision, sharded, quantize, trust_remote_code, uds_path)
File “/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py”, line 155, in serve
asyncio.run(serve_inner(model_id, revision, sharded, quantize, trust_remote_code))
File “/opt/conda/lib/python3.9/asyncio/runners.py”, line 44, in run
return loop.run_until_complete(main)
File “/opt/conda/lib/python3.9/asyncio/base_events.py”, line 634, in run_until_complete
self.run_forever()
File “/opt/conda/lib/python3.9/asyncio/base_events.py”, line 601, in run_forever
self._run_once()
File “/opt/conda/lib/python3.9/asyncio/base_events.py”, line 1905, in _run_once
handle._run()
File “/opt/conda/lib/python3.9/asyncio/events.py”, line 80, in _run
self._context.run(self._callback, *self._args)
File “/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py”, line 124, in serve_inner
model = get_model(model_id, revision, sharded, quantize, trust_remote_code)
File “/opt/conda/lib/python3.9/site-packages/text_generation_server/models/init.py”, line 209, in get_model
return FlashRWSharded(
File “/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_rw.py”, line 161, in init
model=model.to(device),
File “/usr/src/transformers/src/transformers/modeling_utils.py”, line 1903, in to
return super().to(*args, **kwargs)
File “/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 1145, in to
return self._apply(convert)
File “/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 797, in _apply
module._apply(fn)
File “/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 820, in _apply
param_applied = fn(param)
File “/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 1143, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
NotImplementedError: Cannot copy out of meta tensor; no data!
#033[2m#033[3mrank#033[0m#033[2m=#033[0m2#033[0m
#033[2m2023-07-19T12:43:18.744331Z#033[0m #033[31mERROR#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Error when initializing model
Traceback (most recent call last):
File “/opt/conda/bin/text-generation-server”, line 8, in
sys.exit(app())
File “/opt/conda/lib/python3.9/site-packages/typer/main.py”, line 311, in call
return get_command(self)(*args, **kwargs)
File “/opt/conda/lib/python3.9/site-packages/click/core.py”, line 1130, in call
return self.main(*args, **kwargs)
File “/opt/conda/lib/python3.9/site-packages/typer/core.py”, line 778, in main
return _main(
File “/opt/conda/lib/python3.9/site-packages/typer/core.py”, line 216, in _main
rv = self.invoke(ctx)
File “/opt/conda/lib/python3.9/site-packages/click/core.py”, line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File “/opt/conda/lib/python3.9/site-packages/click/core.py”, line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File “/opt/conda/lib/python3.9/site-packages/click/core.py”, line 760, in invoke
return __callback(*args, **kwargs)
File “/opt/conda/lib/python3.9/site-packages/typer/main.py”, line 683, in wrapper
return callback(**use_params) # type: ignore
File “/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py”, line 67, in serve
server.serve(model_id, revision, sharded, quantize, trust_remote_code, uds_path)
File “/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py”, line 155, in serve
asyncio.run(serve_inner(model_id, revision, sharded, quantize, trust_remote_code))
File “/opt/conda/lib/python3.9/asyncio/runners.py”, line 44, in run
return loop.run_until_complete(main)
File “/opt/conda/lib/python3.9/asyncio/base_events.py”, line 634, in run_until_complete
self.run_forever()
File “/opt/conda/lib/python3.9/asyncio/base_events.py”, line 601, in run_forever
self._run_once()
File “/opt/conda/lib/python3.9/asyncio/base_events.py”, line 1905, in _run_once
handle._run()
File “/opt/conda/lib/python3.9/asyncio/events.py”, line 80, in _run
self._context.run(self._callback, *self._args)
File “/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py”, line 124, in serve_inner
model = get_model(model_id, revision, sharded, quantize, trust_remote_code)
File “/opt/conda/lib/python3.9/site-packages/text_generation_server/models/init.py”, line 209, in get_model
return FlashRWSharded(
File “/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_rw.py”, line 161, in init
model=model.to(device),
File “/usr/src/transformers/src/transformers/modeling_utils.py”, line 1903, in to
return super().to(*args, **kwargs)
File “/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 1145, in to
return self._apply(convert)
File “/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 797, in _apply
module._apply(fn)
File “/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 820, in _apply
param_applied = fn(param)
File “/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 1143, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
NotImplementedError: Cannot copy out of meta tensor; no data!
#033[2m#033[3mrank#033[0m#033[2m=#033[0m1#033[0m
#033[2m2023-07-19T12:43:18.744973Z#033[0m #033[31mERROR#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Error when initializing model
Traceback (most recent call last):
File “/opt/conda/bin/text-generation-server”, line 8, in
sys.exit(app())
File “/opt/conda/lib/python3.9/site-packages/typer/main.py”, line 311, in call
return get_command(self)(*args, **kwargs)
File “/opt/conda/lib/python3.9/site-packages/click/core.py”, line 1130, in call
return self.main(*args, **kwargs)
File “/opt/conda/lib/python3.9/site-packages/typer/core.py”, line 778, in main
return _main(
File “/opt/conda/lib/python3.9/site-packages/typer/core.py”, line 216, in _main
rv = self.invoke(ctx)
File “/opt/conda/lib/python3.9/site-packages/click/core.py”, line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File “/opt/conda/lib/python3.9/site-packages/click/core.py”, line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File “/opt/conda/lib/python3.9/site-packages/click/core.py”, line 760, in invoke
return __callback(*args, **kwargs)
File “/opt/conda/lib/python3.9/site-packages/typer/main.py”, line 683, in wrapper
return callback(**use_params) # type: ignore
File “/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py”, line 67, in serve
server.serve(model_id, revision, sharded, quantize, trust_remote_code, uds_path)
File “/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py”, line 155, in serve
asyncio.run(serve_inner(model_id, revision, sharded, quantize, trust_remote_code))
File “/opt/conda/lib/python3.9/asyncio/runners.py”, line 44, in run
return loop.run_until_complete(main)
File “/opt/conda/lib/python3.9/asyncio/base_events.py”, line 634, in run_until_complete
self.run_forever()
File “/opt/conda/lib/python3.9/asyncio/base_events.py”, line 601, in run_forever
self._run_once()
File “/opt/conda/lib/python3.9/asyncio/base_events.py”, line 1905, in _run_once
handle._run()
File “/opt/conda/lib/python3.9/asyncio/events.py”, line 80, in _run
self._context.run(self._callback, *self._args)
File “/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py”, line 124, in serve_inner
model = get_model(model_id, revision, sharded, quantize, trust_remote_code)
File “/opt/conda/lib/python3.9/site-packages/text_generation_server/models/init.py”, line 209, in get_model
return FlashRWSharded(
File “/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_rw.py”, line 161, in init
model=model.to(device),
File “/usr/src/transformers/src/transformers/modeling_utils.py”, line 1903, in to
return super().to(*args, **kwargs)
File “/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 1145, in to
return self._apply(convert)
File “/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 797, in _apply
module._apply(fn)
File “/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 820, in _apply
param_applied = fn(param)
File “/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 1143, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
NotImplementedError: Cannot copy out of meta tensor; no data!
#033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m
#033[2m2023-07-19T12:43:19.455252Z#033[0m #033[31mERROR#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Shard 3 failed to start:
You are using a model of type RefinedWeb to instantiate a model of type . This is not supported for all configurations of models and can yield errors.

My deployment code is the same:
from sagemaker.huggingface import get_huggingface_llm_image_uri

retrieve the llm image uri

llm_image = get_huggingface_llm_image_uri(
“huggingface”,
version=“0.8.2”
)

print ecr image uri

print(f"llm image uri: {llm_image}")

import json
from sagemaker.huggingface import HuggingFaceModel

sagemaker config

instance_type = “ml.g5.12xlarge”
number_of_gpu = 4
health_check_timeout = 300

Define Model and Endpoint configuration parameter

config = {
‘HF_MODEL_ID’: “/opt/ml/model”, # path to where sagemaker stores the model
‘SM_NUM_GPUS’: json.dumps(number_of_gpu), # Number of GPU used per replica
‘MAX_INPUT_LENGTH’: json.dumps(1024), # Max length of input text
‘MAX_TOTAL_TOKENS’: json.dumps(2048), # Max length of the generation (including input text)

‘HF_MODEL_QUANTIZE’: “bitsandbytes”,# Comment in to quantize

}

create HuggingFaceModel with the image uri

llm_model = HuggingFaceModel(
role=role,
image_uri=llm_image,
model_data=s3_model_uri,
env=config
)

Deploy model to an endpoint

Model — sagemaker 2.198.0 documentation

llm = llm_model.deploy(
initial_instance_count=1,
instance_type=instance_type,

volume_size=400, # If using an instance with local SSD storage, volume_size must be None, e.g. p4 but not p3

container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
)

Jorgeutd · July 19, 2023, 7:19pm

Hey @monuirctc did this work for you? Thank you.

philschmid · July 20, 2023, 8:20am

@Jorgeutd which model are you fine-tuning? Falcon 40B?
Can you try if the normal Faclon 40B works?
Any chance we can access your checkpoint? Or can you describe steps to reproduce?

monuirctc · July 20, 2023, 10:43am

No Its not resoled yet

Jorgeutd · July 20, 2023, 3:29pm

Thank you for your answer @philschmid.

Here is modelid that I fine tuned:

model_id = “tiiuae/falcon-40b” # sharded weights

Are you suggesting to try the instruct model?

I was able to reteive the content of the model.tar file and here are the content as well the detailed information if the config.json fille:

Here is the config.json contents:

{
“_name_or_path”: “tiiuae/falcon-40b”,
“alibi”: false,
“apply_residual_connection_post_layernorm”: false,
“architectures”: [
“RWForCausalLM”
],
“attention_dropout”: 0.0,
“auto_map”: {
“AutoConfig”: “tiiuae/falcon-40b–configuration_RW.RWConfig”,
“AutoModel”: “tiiuae/falcon-40b–modelling_RW.RWModel”,
“AutoModelForCausalLM”: “tiiuae/falcon-40b–modelling_RW.RWForCausalLM”,
“AutoModelForQuestionAnswering”: “tiiuae/falcon-40b–modelling_RW.RWForQuestionAnswering”,
“AutoModelForSequenceClassification”: “tiiuae/falcon-40b–modelling_RW.RWForSequenceClassification”,
“AutoModelForTokenClassification”: “tiiuae/falcon-40b–modelling_RW.RWForTokenClassification”
},
“bias”: false,
“bos_token_id”: 11,
“eos_token_id”: 11,
“hidden_dropout”: 0.0,
“hidden_size”: 8192,
“initializer_range”: 0.02,
“layer_norm_epsilon”: 1e-05,
“model_type”: “RefinedWeb”,
“n_head”: 128,
“n_head_kv”: 8,
“n_layer”: 60,
“parallel_attn”: true,
“torch_dtype”: “float16”,
“transformers_version”: “4.30.2”,
“use_cache”: true,
“vocab_size”: 65024
}

Please let me know if you have any suggestions @philschmid

As I said before, I did not have issues with the training step; just the deployment; here is the deployment code again pointing to the location of the model:

from sagemaker.huggingface import get_huggingface_llm_image_uri

retrieve the llm image uri

llm_image = get_huggingface_llm_image_uri(
“huggingface”,
version=“0.8.2”
)

print ecr image uri

print(f"llm image uri: {llm_image}")

import json
from sagemaker.huggingface import HuggingFaceModel

sagemaker config

instance_type = “ml.g5.12xlarge”
number_of_gpu = 4
health_check_timeout = 300

Define Model and Endpoint configuration parameter

config = {
‘HF_MODEL_ID’: “/opt/ml/model”, # path to where sagemaker stores the model
‘SM_NUM_GPUS’: json.dumps(number_of_gpu), # Number of GPU used per replica
‘MAX_INPUT_LENGTH’: json.dumps(1024), # Max length of input text
‘MAX_TOTAL_TOKENS’: json.dumps(2048), # Max length of the generation (including input text)

‘HF_MODEL_QUANTIZE’: “bitsandbytes”,# Comment in to quantize

}

create HuggingFaceModel with the image uri

llm_model = HuggingFaceModel(
role=role,
image_uri=llm_image,
model_data=s3_model_uri,
env=config
)

Deploy model to an endpoint

Model — sagemaker 2.173.0 documentation

llm = llm_model.deploy(
initial_instance_count=1,
instance_type=instance_type,

volume_size=400, # If using an instance with local SSD storage, volume_size must be None, e.g. p4 but not p3

container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
)

Jorgeutd · July 20, 2023, 3:31pm

Thank you @monuirctc

Can you please also shared your deployment code and contents of your tar file if possible of course. So we can figure it why the new DLC is not working properly.

philschmid · July 21, 2023, 7:05am

Thank you we are investigating.

malterei · July 21, 2023, 6:06pm

Hi,

Note: I can only post 2 links because my account is new. See the next posts for all the missing links.
We have successfully deployed a fine-tuned Falcon 7B & Falcon 7B instruct model to an Amazon SageMaker inference endpoint.

The scripts/requirements.txt are the same as in @philschmid’s update.

In the scripts/run_clm.py file we added

modules_to_save=["lm_head"] to the LoraConfig below line 79
trust_remote_code=True to the AutoPeftModelForCausalLM.from_pretrained parameters in line 163.

In the 28_train_llms_with_qlora/sagemaker-notebook.ipynb we added

"model_revision": "2f5c3cd4eace6be6c0f12981f377fb35e5bf6ee5" to the hyperparameters dict for Falcon Instruct 7B in line 312. Use the model revision for your Falcon model type that is newer than the Revert in-library PR.

We trained the model with the above changes.

For the deployment we built the latest v0.9.3 release of the Hugging Face text generation inference container image. Build your own container image and push it to an Amazon Elastic Container Registry repository.

git clone -b v0.9.3 [https://]]github.com/huggingface/text-generation-inference.git
cd text-generation-inference
docker build -t <YOUR_AWS_ACCOUNT_ID>.dkr.ecr.<YOUR_AWS_REGION>.amazonaws.com/<YOUR_ECR_REPO>:0.9.3 --target sagemaker .
docker push <YOUR_AWS_ACCOUNT_ID>.dkr.ecr.<YOUR_AWS_REGION>.amazonaws.com/<YOUR_ECR_REPO>:0.9.3

Our Notebook to deploy the model looks like this:

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Deploy Large Language Models (LLMs) to Amazon SageMaker using Hugging Face Text Generation Inference Container\n",
    "\n",
    "This is an example on how to deploy the open-source LLMs to Amazon SageMaker for inference using your own build of the Hugging Face TGI container.\n",
    "\n",
    "This examples demonstrate how to deploy a fine-tuned model from Amazon S3 to Amazon SageMaker.\n",
    "\n",
    "If you want to learn more about the Hugging Face TGI container check out the Hugging Face TGI GitHub repository. Lets get started!\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Setup development environment\n",
    "\n",
    "We are going to use the `sagemaker` python SDK to deploy to Amazon SageMaker. We need to make sure to have an AWS account configured and the `sagemaker` python SDK installed. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "!pip install \"sagemaker==2.163.0\" \"huggingface_hub\" \"hf-transfer\" --upgrade --quiet"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import sagemaker\n",
    "import boto3\n",
    "sess = sagemaker.Session()\n",
    "\n",
    "\n",
    "# sagemaker session bucket ->| used for uploading data, models and logs\n",
    "# sagemaker will automatically create this bucket if it not exists\n",
    "sagemaker_session_bucket=None\n",
    "if sagemaker_session_bucket is None and sess is not None:\n",
    "    # set to default bucket if a bucket name is not given\n",
    "    sagemaker_session_bucket = sess.default_bucket()\n",
    "\n",
    "try:\n",
    "    role = sagemaker.get_execution_role()\n",
    "except ValueError:\n",
    "    iam = boto3.client('iam')\n",
    "    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']\n",
    "\n",
    "sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)\n",
    "\n",
    "print(f\"sagemaker role arn: {role}\")\n",
    "print(f\"sagemaker session region: {sess.boto_region_name}\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "s3_model_uri = \"<Amazon S3 URI that contains the model.tar.gz of your fine-tuned model>\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Retrieve your HuggingFace TGI container image\n",
    "\n",
    "Compared to deploying regular Hugging Face models we first need to retrieve the container uri and provide it to our `HuggingFaceModel` model class with a `image_uri` pointing to the image.\n",
    "At the time of writing the Hugging Face TGI container image for Amazon SageMake is on version 0.8.2. Version 0.8.2 did not work for us. So we've built our own TGI container image for sagemaker stage docker build from the latest Hugging Face TGI GitHub branch v0.9.3 and pushed the container image to a private Amazon Elastic Container Registry repo.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "llm image uri: 843197046435.dkr.ecr.eu-west-1.amazonaws.com/huggingface/text-generation-inference:0.9.2\n"
     ]
    }
   ],
   "source": [
    "llm_image = \"<YOUR_AWS_ACCOUNT_ID>.dkr.ecr.<YOUR_AWS_REGION>.amazonaws.com/<YOUR_ECR_REPO>:0.9.2\"\n",
    "# print ecr image uri\n",
    "print(f\"llm image uri: {llm_image}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Deploy finetuned-model to Amazon SageMaker\n",
    "\n",
    "To deploy your model to Amazon SageMaker we create a `HuggingFaceModel` model class and define our endpoint configuration including the `hf_model_id`, `instance_type` etc. We will use a `g5.12xlarge` instance type, which has 4 NVIDIA A10G GPUs and 96GB of GPU memory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import json\n",
    "from sagemaker.huggingface import HuggingFaceModel\n",
    "\n",
    "# sagemaker config\n",
    "instance_type = \"ml.g5.2xlarge\"\n",
    "number_of_gpu = 1\n",
    "health_check_timeout = 300\n",
    "\n",
    "# Define Model and Endpoint configuration parameter\n",
    "config = {\n",
    "  'HF_MODEL_ID': \"/opt/ml/model\", # path to where sagemaker stores the mode\n",
    "  'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica\n",
    "  'MAX_INPUT_LENGTH': json.dumps(1024), # Max length of input text\n",
    "  'MAX_TOTAL_TOKENS': json.dumps(2048), # Max length of the generation (including input text)\n",
    "  # 'HF_MODEL_QUANTIZE': \"bitsandbytes\",# Comment in to quantize\n",
    "}\n",
    "\n",
    "# create HuggingFaceModel with the image uri\n",
    "llm_model = HuggingFaceModel(\n",
    "  role=role,\n",
    "  image_uri=llm_image,\n",
    "  model_data=s3_model_uri,\n",
    "  env=config\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "---------!"
     ]
    }
   ],
   "source": [
    "# Deploy model to an endpoint\n",
    "llm = llm_model.deploy(\n",
    "  initial_instance_count=1,\n",
    "  instance_type=instance_type,\n",
    "  # volume_size=400, # If using an instance with local SSD storage, volume_size must be None, e.g. p4 but not p3\n",
    "  container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model\n",
    ")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After we have created the `HuggingFaceModel` we can deploy it to Amazon SageMaker using the `deploy` method. We will deploy the model with the `ml.g5.12xlarge` instance type. TGI will automatically distribute and shard the model across all GPUs."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "SageMaker will now create our endpoint and deploy the model to it. This can takes a 10-15 minutes. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## 5. Test the model and run inference\n",
    "\n",
    "After our endpoint is deployed we can run inference on it. We will use the `predict` method from the `predictor` to run inference on our endpoint. We can inference with different parameters to impact the generation. Parameters can be defined as in the `parameters` attribute of the payload.\n",
    "\n",
    "Replace the prompt with one that is relevant for you model.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Hi Ajay,\n",
      "\n",
      "To filter a list of dictionaries in Python, you can use the `filter()` function along with a lambda expression to iterate over each dictionary in the list and return only the dictionaries that satisfy the given condition. Here's an example:\n",
      "\n",
      "```python\n",
      "my_list = [{'key': 'value', 'child': {'key': 'childvalue'}], [{'key': 'value', 'child': {'key': 'childvalue'}], [{'key': 'value', 'child': {'key': 'childvalue'}]]\n",
      "filtered_list = [dict(item) for item in my_list if item['child']['key'] == 'childvalue']\n",
      "print(filtered_list)\n",
      "```\n",
      "\n",
      "This will output:\n",
      "\n",
      "```python\n",
      "{'key': 'value', 'child': {'key': 'childvalue'}}\n",
      "{'key': 'value', 'child': {'key': 'childvalue'}}\n",
      "```\n",
      "\n",
      "Is there anything else I can help you with?\n",
      "\n",
      "Best regards,\n",
      "Olivia\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# define payload\n",
    "prompt=f\"<|system|>\\n You are an Python Expert<|end|>\\n<|user|>\\n{query}<|end|>\\n<|assistant|>\"\n",
    "\n",
    "# hyperparameters for llm\n",
    "payload = {\n",
    "  \"inputs\": prompt,\n",
    "  \"parameters\": {\n",
    "    \"do_sample\": True,\n",
    "    \"top_p\": 0.95,\n",
    "    \"temperature\": 0.2,\n",
    "    \"top_k\": 50,\n",
    "    \"max_new_tokens\": 256,\n",
    "    \"repetition_penalty\": 1.03,\n",
    "    \"stop\": [\"<|end|>\"]\n",
    "  }\n",
    "}\n",
    "\n",
    "# send request to endpoint\n",
    "response = llm.predict(payload)\n",
    "\n",
    "# print(response[0][\"generated_text\"][:-len(\"<human>:\")])\n",
    "print(response[0][\"generated_text\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Awesome! 🚀 We have successfully deployed our model from Amazon S3 to Amazon SageMaker and run inference on it. Now, its time for you to try it out yourself and build Generation AI applications with the new Hugging Face TGI container image on Amazon SageMaker."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Clean up\n",
    "\n",
    "To clean up, we can delete the model and endpoint.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "llm.delete_model()\n",
    "llm.delete_endpoint()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "nbformat": 4,
 "nbformat_minor": 4
}

Hopefully this helps others.

Topic		Replies	Views
QLoRA trained Mixtral 8x7B deployment error on Sagemaker using text generation inference image Amazon SageMaker	0	305	April 10, 2024
Falcon 40B instruct training with QLora, Sagemaker model artifact location Amazon SageMaker	3	399	September 21, 2023
Sagemaker Pipelines with fintuned llama2 Amazon SageMaker	0	852	September 12, 2023
VRAM Usage Differences in SageMaker Training Jobs vs. Direct Instance for Fine-Tuning LLama3 8B with QLoRA Amazon SageMaker	0	61	October 18, 2024
Llama2 fine-tunning with PEFT QLora and testing the model 🤗Transformers	13	15232	December 21, 2023

Deploying Fine-Tune Falcon 40B with QLoRA on Sagemaker Inference Error

sagemaker config

Define Model and Endpoint configuration parameter

‘HF_MODEL_QUANTIZE’: “bitsandbytes”,# Comment in to quantize

create HuggingFaceModel with the image uri

retrieve the llm image uri

print ecr image uri

sagemaker config

Define Model and Endpoint configuration parameter

‘HF_MODEL_QUANTIZE’: “bitsandbytes”,# Comment in to quantize

create HuggingFaceModel with the image uri

Deploy model to an endpoint

Model — sagemaker 2.198.0 documentation

volume_size=400, # If using an instance with local SSD storage, volume_size must be None, e.g. p4 but not p3

retrieve the llm image uri

print ecr image uri

sagemaker config

Define Model and Endpoint configuration parameter

‘HF_MODEL_QUANTIZE’: “bitsandbytes”,# Comment in to quantize

create HuggingFaceModel with the image uri

Deploy model to an endpoint

Model — sagemaker 2.173.0 documentation

volume_size=400, # If using an instance with local SSD storage, volume_size must be None, e.g. p4 but not p3

Related topics