I have trained Falcon 40B with Qlora as a Sagemaker training job based on this tutorial. Train LLMs using QLoRA on Amazon SageMaker .
Currently, the training is mentioned as complete. and the process is getting stuck in Model upload for more than an hour.
These are the log messages:
2023-09-21 10:47:09,537 sagemaker-training-toolkit INFO Done waiting for a return code. Received 0 from exiting process.
2023-09-21 10:47:09,537 sagemaker-training-toolkit INFO Reporting training SUCCESS
2023-09-21 10:47:14 Uploading - Uploading generated training model
I checked the S3 path where the training job contents are stored, but no model artifacts are present there.
Please can you suggest how to access the model artifacts that are saved post the model training process.
Upload the model at the end of the training job can take multiple hours depending on the model size and if you used compress upload.
Thank you for the quick reply, I’ll do accordingly and check. Is there any way to login to the Training job machine and confirm if the model is created.
Thank you for the solution. Now, I am able to access the model artifacts in s3. Post that I have started deploying based on this tutorial Deploy Falcon 7B & 40B on Amazon SageMaker and I am getting the following error in Endpoint Creation.
File “/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py”, line 124, in serve_inner model = get_model(model_id, revision, sharded, quantize, trust_remote_code) File “/opt/conda/lib/python3.9/site-packages/text_generation_server/models/init.py”, line 209, in get_model return FlashRWSharded( File “/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_rw.py”, line 161, in init model=model.to(device), File “/usr/src/transformers/src/transformers/modeling_utils.py”, line 1903, in to return super().to(*args, **kwargs) File “/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 1145, in to return self._apply(convert) File “/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 797, in _apply module._apply(fn) File “/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 820, in _apply param_applied = fn(param) File “/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 1143, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
NotImplementedError: Cannot copy out of meta tensor; no data!
#033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m
Please can you suggest any way to debug this error.