Training Spot Instance: meaning of parameters max_run and max_wait

Hi,

in the notebook sagemaker >> 05_spot_instances >> sagemaker-notebook.ipynb, we need to setup the parameters max_run and max_wait as said in the Hugging Face doc Spot instances.

However, there is no explanation about their meaning.

I searched more information the AWS SageMaker doc and found the parameters MaxRuntimeInSeconds and MaxWaitTimeInSeconds:

  • MaxRuntimeInSeconds: The maximum length of time, in seconds, that a training or compilation job can run.
  • MaxWaitTimeInSeconds: The maximum length of time, in seconds, that a managed Spot training job has to complete. It is the amount of time spent waiting for Spot capacity plus the amount of time the job can run. It must be equal to or greater than MaxRuntimeInSeconds . If the job does not complete during this time, Amazon SageMaker ends the job.

I guess that:

  • max_run = MaxRuntimeInSeconds
  • max_wait = MaxWaitTimeInSeconds

max_run

My question is: about max_run and in the case of the training of a Hugging Face model with Trainer(), what is a “training or compilation job”?

  1. Is it the whole training job from the first step to the final one (ie, all the epochs)?
  2. Is it the training job by checkpoint (by epoch or by the number of defined steps) and separately, the evaluation job at the end of each checkpoint?

max_wait

My question is: about max_wait and in the case of the training of a Hugging Face model with Trainer(), what is a “managed Spot training job”?

  1. (same question 1 as for max_run) whole training job?
  2. (same question 2 as for max_run) training job by checkpoint?

cc @philschmid