Training Spot Instance: meaning of parameters max_run and max_wait

yes good points :slight_smile:

The max duration of a SageMaker Training job is 5 days. So you could set this as max_run if you want to run for as long as possible. If you want to run something longer, you can try chaining together multiple checkpointed training jobs (I haven’t seen it done yet though), for example using a Lambda launching job N+1 when triggered by the notification emitted when job N stops

you can take a big enough max_run. (less than 5 days though) Or if you checkpoint your training state, you can restart it later from another training job, so that interruptions don’t really matter.

Also note that if you train things for multiple days on one p3.2xlarge and want to go faster, it’s worth considering data-parallel training (example here https://huggingface.co/blog/sagemaker-distributed-training-seq2seq)

1 Like