Training Spot Instance: meaning of parameters max_run and max_wait

OlivierCR · January 5, 2022, 11:41pm

yes good points

The max duration of a SageMaker Training job is 5 days. So you could set this as max_run if you want to run for as long as possible. If you want to run something longer, you can try chaining together multiple checkpointed training jobs (I haven’t seen it done yet though), for example using a Lambda launching job N+1 when triggered by the notification emitted when job N stops

you can take a big enough max_run. (less than 5 days though) Or if you checkpoint your training state, you can restart it later from another training job, so that interruptions don’t really matter.

Also note that if you train things for multiple days on one p3.2xlarge and want to go faster, it’s worth considering data-parallel training (example here https://huggingface.co/blog/sagemaker-distributed-training-seq2seq)

Topic		Replies	Views
Inference Hyperparameters Amazon SageMaker	29	4858	October 8, 2021
Distributed Training run_summarization.py Amazon SageMaker	3	938	July 30, 2021
Huggingface SageMaker instance_type on example code Beginners	0	315	July 14, 2022
Spot instances with Sagemaker batch transform? Amazon SageMaker	1	1832	December 10, 2021
[Nov 16th Event] Philipp Schmid: Managed Training with Amazon SageMaker and 🤗 Transformers Course	3	443	November 16, 2021

Training Spot Instance: meaning of parameters max_run and max_wait

Related topics