Training Spot Instance: meaning of parameters max_run and max_wait

pierreguillou · January 5, 2022, 7:44pm

Hi,

in the notebook sagemaker >> 05_spot_instances >> sagemaker-notebook.ipynb, we need to setup the parameters max_run and max_wait as said in the Hugging Face doc Spot instances.

However, there is no explanation about their meaning.

I searched more information the AWS SageMaker doc and found the parameters MaxRuntimeInSeconds and MaxWaitTimeInSeconds:

MaxRuntimeInSeconds: The maximum length of time, in seconds, that a training or compilation job can run.
MaxWaitTimeInSeconds: The maximum length of time, in seconds, that a managed Spot training job has to complete. It is the amount of time spent waiting for Spot capacity plus the amount of time the job can run. It must be equal to or greater than MaxRuntimeInSeconds . If the job does not complete during this time, Amazon SageMaker ends the job.

I guess that:

max_run

My question is: about max_run and in the case of the training of a Hugging Face model with Trainer(), what is a “training or compilation job”?

Is it the whole training job from the first step to the final one (ie, all the epochs)?
Is it the training job by checkpoint (by epoch or by the number of defined steps) and separately, the evaluation job at the end of each checkpoint?

My question is: about max_wait and in the case of the training of a Hugging Face model with Trainer(), what is a “managed Spot training job”?

Topic		Replies	Views
Inference Hyperparameters Amazon SageMaker	29	4841	October 8, 2021
Huggingface SageMaker instance_type on example code Beginners	0	313	July 14, 2022
Running out of memory with all except the basic GPT2 and GPT neo models on sagemaker127M Beginners	0	248	March 31, 2023
Sagemaker gpt-j train file error Amazon SageMaker	27	2913	February 22, 2024
Distributed Training run_summarization.py Amazon SageMaker	3	935	July 30, 2021