Deploying Fine-Tune Falcon 40B with QLoRA on Sagemaker Inference Error

Missing links:

@malterei My issue was the with falcon model:

model_id = “tiiuae/falcon-40b” # sharded weights

So just to clarify the current DLC does not support this model, just the 7b model?

Thank you.

I didn’t get 7b working with TGI container image 0.8.2 (763104351884.dkr.ecr.eu-west-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.0-tgi0.8.2-gpu-py39-cu118-ubuntu20.04).

Building the latest TGI container image for SageMaker (GitHub - huggingface/text-generation-inference at v0.9.3) and the other instructions I described above was what makes 7b work for me.

I don’t have the time right now to train a 40b with my instructions.

If you have time maybe you can try my instructions with 7b or 40b to validate them?

1 Like

Am i correct in saying that the current DLC does not support tiiuae/falcon-40b-instruct deployment, as the model weights are not in safetensor format?
I have the following error when trying to deploy the pre-trained model on SageMaker:

safetensors_rust.SafetensorError: Error while serializing: IoError(Os { code: 30, kind: ReadOnlyFilesystem, message: “Read-only file system” })

I see that the workaround suggested above is to convert the Pytorch model weights to safetensor format during training but what is the current workaround for deploying as is?

1 Like

Thats not correct the DLC support deploying Falcon see: Deploy Falcon 7B & 40B on Amazon SageMaker
But to make things easier having your weights in safetensors decreases the start up time.

1 Like

The issue that i am facing is that i am trying to deploy the model on SageMaker within a VPC (with no access to public internet) and when deploying, i am unable to download the model from an S3 bucket to /opt/ml/model as the filesystem is read-only therefore, i am unable to convert the pytorch model weights to safetensor format. Can i deploy the model as is (i.e without converting weights during training as suggested)?

Note: When i say ‘as is’, the model.tar.gz file looks like this…

1 Like

@philschmid Any solution to this? Facing the same issue with deploying Fine-Tune L2-70b on g5.48xlarge gptq quantized.

Here’s the repo:
shekharchatterjee/temp-model-174 ¡ Hugging Face

Error: DownloadError
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 151, in download_weights
    utils.convert_files(local_pt_files, local_st_files)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/convert.py", line 84, in convert_files
    convert_file(pt_file, sf_file)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/convert.py", line 52, in convert_file
    pt_state = torch.load(pt_file, map_location="cpu")
  File "/opt/conda/lib/python3.9/site-packages/torch/serialization.py", line 815, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/opt/conda/lib/python3.9/site-packages/torch/serialization.py", line 1033, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)

I am stuck in the same situation. Did you get it solved?

Yes, I have since converted the pytorch version of the model to Safetensor format using this:

GitHub - Silver267/pytorch-to-safetensor-converter: A simple converter which converts pytorch bin files to safetensor, intended to be used for LLM conversion.

@philschmid @Jorgeutd
Hi guys, any solution for this issue?
I am facing the same issue when trying to deploy Mistral 7B. Training completes successfully, but the deployment gives this error: raise RuntimeError(f"weight {tensor_name} does not exist")

here is what I am using:
llm_image_uri_ver=“1.3.1”
llm_image = get_huggingface_llm_image_uri(
“huggingface”, # huggingface or lmi
version=llm_image_uri_ver,
session=Sagemaker_Session,
region=region_name
)
config = {
‘HF_MODEL_ID’: “/opt/ml/model”, # model_id from Models - Hugging Face
‘SM_NUM_GPUS’: json.dumps(number_of_gpu), # Number of GPU used per replica
‘MAX_INPUT_LENGTH’: json.dumps(MAX_INPUT_LENGTH), # Max length of input text
‘MAX_TOTAL_TOKENS’: json.dumps(MAX_TOTAL_TOKENS), # Max length of the generation (including input text)
‘MAX_BATCH_TOTAL_TOKENS’: json.dumps(MAX_BATCH_TOTAL_TOKENS), # Limits the number of tokens that can be processed in parallel during the generation
‘MAX_BATCH_PREFILL_TOKENS’: json.dumps(MAX_BATCH_PREFILL_TOKENS),
‘HUGGING_FACE_HUB_TOKEN’: HUGGING_FACE_HUB_TOKEN,
‘HF_TASK’:“text-classification”,
}
llm_model = HuggingFaceModel(
role=my_role,
image_uri=llm_image,
env=config,
sagemaker_session=Sagemaker_Session,
model_data=s3_train_model_path,
)