Deploying Fine-Tune Falcon 40B with QLoRA on Sagemaker Inference Error

Missing links:

@malterei My issue was the with falcon model:

model_id = “tiiuae/falcon-40b” # sharded weights

So just to clarify the current DLC does not support this model, just the 7b model?

Thank you.

I didn’t get 7b working with TGI container image 0.8.2 (763104351884.dkr.ecr.eu-west-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.0-tgi0.8.2-gpu-py39-cu118-ubuntu20.04).

Building the latest TGI container image for SageMaker (GitHub - huggingface/text-generation-inference at v0.9.3) and the other instructions I described above was what makes 7b work for me.

I don’t have the time right now to train a 40b with my instructions.

If you have time maybe you can try my instructions with 7b or 40b to validate them?

1 Like

Am i correct in saying that the current DLC does not support tiiuae/falcon-40b-instruct deployment, as the model weights are not in safetensor format?
I have the following error when trying to deploy the pre-trained model on SageMaker:

safetensors_rust.SafetensorError: Error while serializing: IoError(Os { code: 30, kind: ReadOnlyFilesystem, message: “Read-only file system” })

I see that the workaround suggested above is to convert the Pytorch model weights to safetensor format during training but what is the current workaround for deploying as is?

Thats not correct the DLC support deploying Falcon see: Deploy Falcon 7B & 40B on Amazon SageMaker
But to make things easier having your weights in safetensors decreases the start up time.

1 Like

The issue that i am facing is that i am trying to deploy the model on SageMaker within a VPC (with no access to public internet) and when deploying, i am unable to download the model from an S3 bucket to /opt/ml/model as the filesystem is read-only therefore, i am unable to convert the pytorch model weights to safetensor format. Can i deploy the model as is (i.e without converting weights during training as suggested)?

Note: When i say ‘as is’, the model.tar.gz file looks like this…

@philschmid Any solution to this? Facing the same issue with deploying Fine-Tune L2-70b on g5.48xlarge gptq quantized.

Here’s the repo:
shekharchatterjee/temp-model-174 ¡ Hugging Face

Error: DownloadError
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 151, in download_weights
    utils.convert_files(local_pt_files, local_st_files)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/convert.py", line 84, in convert_files
    convert_file(pt_file, sf_file)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/convert.py", line 52, in convert_file
    pt_state = torch.load(pt_file, map_location="cpu")
  File "/opt/conda/lib/python3.9/site-packages/torch/serialization.py", line 815, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/opt/conda/lib/python3.9/site-packages/torch/serialization.py", line 1033, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)

I am stuck in the same situation. Did you get it solved?

Yes, I have since converted the pytorch version of the model to Safetensor format using this:

GitHub - Silver267/pytorch-to-safetensor-converter: A simple converter which converts pytorch bin files to safetensor, intended to be used for LLM conversion.

@philschmid @Jorgeutd
Hi guys, any solution for this issue?
I am facing the same issue when trying to deploy Mistral 7B. Training completes successfully, but the deployment gives this error: raise RuntimeError(f"weight {tensor_name} does not exist")

here is what I am using:
llm_image_uri_ver=“1.3.1”
llm_image = get_huggingface_llm_image_uri(
“huggingface”, # huggingface or lmi
version=llm_image_uri_ver,
session=Sagemaker_Session,
region=region_name
)
config = {
‘HF_MODEL_ID’: “/opt/ml/model”, # model_id from Models - Hugging Face
‘SM_NUM_GPUS’: json.dumps(number_of_gpu), # Number of GPU used per replica
‘MAX_INPUT_LENGTH’: json.dumps(MAX_INPUT_LENGTH), # Max length of input text
‘MAX_TOTAL_TOKENS’: json.dumps(MAX_TOTAL_TOKENS), # Max length of the generation (including input text)
‘MAX_BATCH_TOTAL_TOKENS’: json.dumps(MAX_BATCH_TOTAL_TOKENS), # Limits the number of tokens that can be processed in parallel during the generation
‘MAX_BATCH_PREFILL_TOKENS’: json.dumps(MAX_BATCH_PREFILL_TOKENS),
‘HUGGING_FACE_HUB_TOKEN’: HUGGING_FACE_HUB_TOKEN,
‘HF_TASK’:“text-classification”,
}
llm_model = HuggingFaceModel(
role=my_role,
image_uri=llm_image,
env=config,
sagemaker_session=Sagemaker_Session,
model_data=s3_train_model_path,
)