ClientError: Artifact upload failed:Error 5

NDugar · November 19, 2021, 7:15pm

after 0.07 epochs the training job stops and gives the following error :-

[INFO|trainer.py:1885] 2021-11-19 18:43:48,502 >> Saving model checkpoint to /opt/ml/model/checkpoint-3500
[INFO|configuration_utils.py:351] 2021-11-19 18:43:48,503 >> Configuration saved in /opt/ml/model/checkpoint-3500/config.json
#015Downloading:   0%|          | 0.00/7.78k [00:00<?, ?B/s]#015Downloading: 28.8kB [00:00, 16.0MB/s]                   
#015Downloading:   0%|          | 0.00/4.47k [00:00<?, ?B/s]#015Downloading: 28.7kB [00:00, 17.6MB/s]                   

2021-11-19 18:44:10 Uploading - Uploading generated training model
2021-11-19 18:44:10 Failed - Training job failed
ProfilerReport-1637341687: Stopping
---------------------------------------------------------------------------
UnexpectedStatusException                 Traceback (most recent call last)
<ipython-input-13-6a9d8eb3a402> in <module>
     33 
     34 # starting the train job
---> 35 huggingface_estimator.fit()

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config)
    690         self.jobs.append(self.latest_training_job)
    691         if wait:
--> 692             self.latest_training_job.wait(logs=logs)
    693 
    694     def _compilation_job_name(self):

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/estimator.py in wait(self, logs)
   1650         # If logs are requested, call logs_for_jobs.
   1651         if logs != "None":
-> 1652             self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
   1653         else:
   1654             self.sagemaker_session.wait_for_job(self.job_name)

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll, log_type)
   3776 
   3777         if wait:
-> 3778             self._check_job_status(job_name, description, "TrainingJobStatus")
   3779             if dot:
   3780                 print()

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name)
   3333                 ),
   3334                 allowed_statuses=["Completed", "Stopped"],
-> 3335                 actual_status=status,
   3336             )
   3337 

UnexpectedStatusException: Error for Training job huggingface-pytorch-training-2021-11-19-17-08-07-355: Failed. Reason: ClientError: Artifact upload failed:Error 5: Received a failed archive status.

Thank you for your help.

philschmid · November 22, 2021, 8:52am

What did you save to opt/ml/model ?

NDugar · November 22, 2021, 9:02pm

the model with its checkpoints as folders

philschmid · November 23, 2021, 11:07am

Do you know how big it was? This normally work.

NDugar · November 24, 2021, 3:58pm

Not really but can Easily go to 100gb plus.

philschmid · December 1, 2021, 8:10am

You could use checkpointing with sagemaker, which automatically syncs the checkpoints to S3 when they are created and only save the model without checkpoints at the end to /opt/ml/model

NDugar · December 1, 2021, 8:25am

ok thanks.

Topic		Replies	Views
InternalServerError after model training finishes, but fails to upload? Amazon SageMaker	4	1131	August 31, 2021
Sage maker training failing for many models Models	0	160	June 23, 2024
Training on Sagemaker with Trainer() Instance Amazon SageMaker	6	2279	November 3, 2021
Falcon 40B instruct training with QLora, Sagemaker model artifact location Amazon SageMaker	3	399	September 21, 2023
Training model file too large and fail to deploy Amazon SageMaker	3	1377	July 3, 2023

ClientError: Artifact upload failed:Error 5

Related topics