Hi,
in the notebook sagemaker >> 05_spot_instances >> sagemaker-notebook.ipynb, we need to setup the parameters max_run
and max_wait
as said in the Hugging Face doc Spot instances.
However, there is no explanation about their meaning.
I searched more information the AWS SageMaker doc and found the parameters MaxRuntimeInSeconds and MaxWaitTimeInSeconds:
- MaxRuntimeInSeconds: The maximum length of time, in seconds, that a training or compilation job can run.
-
MaxWaitTimeInSeconds: The maximum length of time, in seconds, that a managed Spot training job has to complete. It is the amount of time spent waiting for Spot capacity plus the amount of time the job can run. It must be equal to or greater than
MaxRuntimeInSeconds
. If the job does not complete during this time, Amazon SageMaker ends the job.
I guess that:
-
max_run
= MaxRuntimeInSeconds -
max_wait
= MaxWaitTimeInSeconds
max_run
My question is: about max_run
and in the case of the training of a Hugging Face model with Trainer(), what is a “training or compilation job”?
- Is it the whole training job from the first step to the final one (ie, all the epochs)?
- Is it the training job by checkpoint (by epoch or by the number of defined steps) and separately, the evaluation job at the end of each checkpoint?
max_wait
My question is: about max_wait
and in the case of the training of a Hugging Face model with Trainer(), what is a “managed Spot training job”?
- (same question 1 as for
max_run
) whole training job? - (same question 2 as for
max_run
) training job by checkpoint?
cc @philschmid