Training Spot Instance: meaning of parameters max_run and max_wait

See the documentation here Estimators — sagemaker 2.72.1 documentation - sorry if it’s hard to find, I’ll circulate the feedback on our side

Both max_run and max_wait are SageMaker Training parameters, they have no connection with Hugging Face. They control how your job (Hugging Face or other code) behaves when using Spot capacity (use_spot_instances=True)

EC2 Spot is spare compute capacity available at a discount. Read more here: Amazon EC2 Spot – Save up-to 90% on On-Demand Prices
The 2 most important things to be aware about Spot are:

  1. Capacity is variable, becaused based on spare compute. It’s better to use it opportunistically and be ready to fall-back to on-demand (or be ready to wait for availability) vs having expectations to be able to use any amount of Spot all the time and at any time
  2. Spot capacity can be reclaimed. Workloads running on Spot should either be quick enough not to worry about interruptions, or have interruption-handling mechanisms built-in

In SageMaker, you can decide to run your job on EC2 Spot capacity, and get up to 90% savings vs the public on-demand price (in my personal experience, it’s often around 70%, as reported here).
The SageMaker Spot is called “Managed Spot”, because it is easier to use than raw EC2 Spot:

  • You just need to specify 3 parameters (use_spot_instances, max_wait, max_run)
  • SageMaker has a checkpoint feature: you can read and write from the checkpoint location to persist and recover state across Spot interruptions (anything written to the local checkpoint location goes immediately to S3, and a fresh job loads to the checkpoint location the content of the S3 checkpoint location). Note that it is however your responsibility to know what to save and how to read so that the interruptions are handled gracefully. SageMaker Checkpoint just sync your files with S3 and reloads them post-interruption ; it is not aware of how those files should be handled, this should be done by your code.

In particular:

  • max_run is the max cumulative duration of the SageMaker Training job
  • max_wait is the max time you are going to wait for your job to complete (max_run + Spot waiting time)

And max_wait must but larger than max_run

1 Like