Falcon 40B instruct training with QLora, Sagemaker model artifact location

SharkAISolutions · September 21, 2023, 12:53pm

I have trained Falcon 40B with Qlora as a Sagemaker training job based on this tutorial. Train LLMs using QLoRA on Amazon SageMaker .

Currently, the training is mentioned as complete. and the process is getting stuck in Model upload for more than an hour.

These are the log messages:

2023-09-21 10:47:09,537 sagemaker-training-toolkit INFO Done waiting for a return code. Received 0 from exiting process.
2023-09-21 10:47:09,537 sagemaker-training-toolkit INFO Reporting training SUCCESS

2023-09-21 10:47:14 Uploading - Uploading generated training model

I checked the S3 path where the training job contents are stored, but no model artifacts are present there.

Please can you suggest how to access the model artifacts that are saved post the model training process.

philschmid · September 21, 2023, 1:05pm

Upload the model at the end of the training job can take multiple hours depending on the model size and if you used compress upload.

SharkAISolutions · September 21, 2023, 1:35pm

Thank you for the quick reply, I’ll do accordingly and check. Is there any way to login to the Training job machine and confirm if the model is created.

SharkAISolutions · September 21, 2023, 5:47pm

Thank you for the solution. Now, I am able to access the model artifacts in s3. Post that I have started deploying based on this tutorial Deploy Falcon 7B & 40B on Amazon SageMaker and I am getting the following error in Endpoint Creation.

File “/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py”, line 124, in serve_inner model = get_model(model_id, revision, sharded, quantize, trust_remote_code) File “/opt/conda/lib/python3.9/site-packages/text_generation_server/models/init.py”, line 209, in get_model return FlashRWSharded( File “/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_rw.py”, line 161, in init model=model.to(device), File “/usr/src/transformers/src/transformers/modeling_utils.py”, line 1903, in to return super().to(*args, **kwargs) File “/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 1145, in to return self._apply(convert) File “/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 797, in _apply module._apply(fn) File “/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 820, in _apply param_applied = fn(param) File “/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 1143, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)

NotImplementedError: Cannot copy out of meta tensor; no data!
#033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m

Please can you suggest any way to debug this error.

Topic		Replies	Views
Unable to deploy Falcon 40b OASST1 model into SageMaker TGI container Amazon SageMaker	0	432	July 29, 2023
Deploying Fine-Tune Falcon 40B with QLoRA on Sagemaker Inference Error Amazon SageMaker	29	6819	January 8, 2024
QLoRA trained LLaMA2 13B deployment error on Sagemaker using text generation inference image Amazon SageMaker	14	2978	August 18, 2023
ClientError: Artifact upload failed:Error 5 Amazon SageMaker	6	2403	December 1, 2021
How to access to /opt/ml/model before the end of the model training? Amazon SageMaker	4	3912	December 9, 2021

Falcon 40B instruct training with QLora, Sagemaker model artifact location

Related topics