Training on Sagemaker with Trainer() Instance

Hello everyone,

I wanted to train a NLP classifer on our server but it takes around 9 hours for a training. So I wanted to switch the training process to Sagemaker. When I just copy my code with the Trainer() instance (trainer.train()) I get the following error:
ImportError: torch>=1.5.0 is required for a normal functioning of this module, but found torch==1.4.0.

And it looks like I cannot update my torch on Sagemaker to 1.5.0.

On my research I found out that many trainings are done with the HugginFace estimator. Do I always have to use this and change my “local server code”?

How did you start your training? Which service di you use?

SageMaker has different options there are Notebook instances which are just hoster Jupyter Services and then there is also the Training Platform, which uses the HuggingFace estimator as shown in all examples.

Well I am using a Jupyter Notebook in Sagemaker with this kernel:

4 vCPU + 16 GiB + 1 GPU Python 3 (PyTorch 1.4 Python 3.6 GPU Optimized)

and install all needed packages like torch or transformers with the first line of code. If I go with my classical way like creating a dataset object, defining the model and training with Trainer() it raises the error I explained.

Meanwhile, I tried to work with the Huggingface platform and followed your tutorial:

Therefore, I transformed my dataset to .json but get this error:

UnexpectedStatusException: Error for Training job operations-email-classification-2021-11-2021-11-03-14-40-49-703: Failed. Reason: AlgorithmError: framework error:
Traceback (most recent call last):
File “/opt/conda/lib/python3.6/site-packages/sagemaker_training/trainer.py”, line 85, in train
entrypoint()
File “/opt/conda/lib/python3.6/site-packages/sagemaker_pytorch_container/training.py”, line 121, in main
train(environment.Environment())
File “/opt/conda/lib/python3.6/site-packages/sagemaker_pytorch_container/training.py”, line 73, in train
runner_type=runner_type)
File “/opt/conda/lib/python3.6/site-packages/sagemaker_training/entry_point.py”, line 92, in run
files.download_and_extract(uri=uri, path=environment.code_dir)
File “/opt/conda/lib/python3.6/site-packages/sagemaker_training/files.py”, line 131, in download_and_extract
s3_download(uri, dst)
File “/opt/conda/lib/python3.6/site-packages/sagemaker_training/files.py”, line 167, in s3_download
s3.Bucket(bucket).download_file(key, dst)
File “/opt/conda/lib/python3.6/site-packages/boto3/s3/inject.py”, line 246, in bucket_download_file
Extr

Is this the full error stack you recieve? it looks like there is something missing

This is what I am receiving when using the Huggingface estimator:

2021-11-03 15:08:16 Starting - Starting the training job...
2021-11-03 15:08:42 Starting - Launching requested ML instancesProfilerReport-1635952096: InProgress
...
2021-11-03 15:09:13 Starting - Preparing the instances for training............
2021-11-03 15:11:02 Downloading - Downloading input data
2021-11-03 15:11:02 Training - Downloading the training image..................
2021-11-03 15:14:16 Uploading - Uploading generated training modelbash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2021-11-03 15:14:11,348 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
2021-11-03 15:14:11,372 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
2021-11-03 15:14:12,797 sagemaker_pytorch_container.training INFO     Invoking user training script.
2021-11-03 15:14:13,102 sagemaker-training-toolkit ERROR    Reporting training FAILURE
2021-11-03 15:14:13,102 sagemaker-training-toolkit ERROR    framework error: 
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/sagemaker_training/trainer.py", line 85, in train
    entrypoint()
  File "/opt/conda/lib/python3.6/site-packages/sagemaker_pytorch_container/training.py", line 121, in main
    train(environment.Environment())
  File "/opt/conda/lib/python3.6/site-packages/sagemaker_pytorch_container/training.py", line 73, in train
    runner_type=runner_type)
  File "/opt/conda/lib/python3.6/site-packages/sagemaker_training/entry_point.py", line 92, in run
    files.download_and_extract(uri=uri, path=environment.code_dir)
  File "/opt/conda/lib/python3.6/site-packages/sagemaker_training/files.py", line 131, in download_and_extract
    s3_download(uri, dst)
  File "/opt/conda/lib/python3.6/site-packages/sagemaker_training/files.py", line 167, in s3_download
    s3.Bucket(bucket).download_file(key, dst)
  File "/opt/conda/lib/python3.6/site-packages/boto3/s3/inject.py", line 246, in bucket_download_file
    ExtraArgs=ExtraArgs, Callback=Callback, Config=Config)
  File "/opt/conda/lib/python3.6/site-packages/boto3/s3/inject.py", line 172, in download_file
    extra_args=ExtraArgs, callback=Callback)
  File "/opt/conda/lib/python3.6/site-packages/boto3/s3/transfer.py", line 307, in download_file
    future.result()
  File "/opt/conda/lib/python3.6/site-packages/s3transfer/futures.py", line 106, in result
    return self._coordinator.result()
  File "/opt/conda/lib/python3.6/site-packages/s3transfer/futures.py", line 265, in result
    raise self._exception
  File "/opt/conda/lib/python3.6/site-packages/s3transfer/tasks.py", line 255, in _main
    self._submit(transfer_future=transfer_future, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/s3transfer/download.py", line 343, in _submit
    **transfer_future.meta.call_args.extra_args
  File "/opt/conda/lib/python3.6/site-packages/botocore/client.py", line 386, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/opt/conda/lib/python3.6/site-packages/botocore/client.py", line 705, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (404) when calling the HeadObject operation: Not Found

An error occurred (404) when calling the HeadObject operation: Not Found

2021-11-03 15:14:45 Failed - Training job failed
ProfilerReport-1635952096: Stopping
UnexpectedStatusException                 Traceback (most recent call last)
<ipython-input-118-7a4d3426c635> in <module>
      6 
      7 # starting the train job with our uploaded datasets as input
----> 8 huggingface_estimator.fit(training_data, wait=True)

/opt/conda/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config)
    690         self.jobs.append(self.latest_training_job)
    691         if wait:
--> 692             self.latest_training_job.wait(logs=logs)
    693 
    694     def _compilation_job_name(self):

/opt/conda/lib/python3.6/site-packages/sagemaker/estimator.py in wait(self, logs)
   1650         # If logs are requested, call logs_for_jobs.
   1651         if logs != "None":
-> 1652             self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
   1653         else:
   1654             self.sagemaker_session.wait_for_job(self.job_name)

/opt/conda/lib/python3.6/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll, log_type)
   3776 
   3777         if wait:
-> 3778             self._check_job_status(job_name, description, "TrainingJobStatus")
   3779             if dot:
   3780                 print()

/opt/conda/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name)
   3333                 ),
   3334                 allowed_statuses=["Completed", "Stopped"],
-> 3335                 actual_status=status,
   3336             )
   3337 

UnexpectedStatusException: Error for Training job operations-email-classification-2021-11-2021-11-03-15-08-16-273: Failed. Reason: AlgorithmError: framework error: 
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/sagemaker_training/trainer.py", line 85, in train
    entrypoint()
  File "/opt/conda/lib/python3.6/site-packages/sagemaker_pytorch_container/training.py", line 121, in main
    train(environment.Environment())
  File "/opt/conda/lib/python3.6/site-packages/sagemaker_pytorch_container/training.py", line 73, in train
    runner_type=runner_type)
  File "/opt/conda/lib/python3.6/site-packages/sagemaker_training/entry_point.py", line 92, in run
    files.download_and_extract(uri=uri, path=environment.code_dir)
  File "/opt/conda/lib/python3.6/site-packages/sagemaker_training/files.py", line 131, in download_and_extract
    s3_download(uri, dst)
  File "/opt/conda/lib/python3.6/site-packages/sagemaker_training/files.py", line 167, in s3_download
    s3.Bucket(bucket).download_file(key, dst)
  File "/opt/conda/lib/python3.6/site-packages/boto3/s3/inject.py", line 246, in bucket_download_file
    Extr

Whatever you do in your script i cannot download the file from s3 so either you don’t have permission or the file doesn’t exist. can you share your script?

I think I don’t have permissions to down- and upload any files. I am not sure if I can post the stuff here. I d prefer to send it to you via pm and then later share the results?