Training Spot Instance: meaning of parameters max_run and max_wait

Hi,

in the notebook sagemaker >> 05_spot_instances >> sagemaker-notebook.ipynb, we need to setup the parameters max_run and max_wait as said in the Hugging Face doc Spot instances.

However, there is no explanation about their meaning.

I searched more information the AWS SageMaker doc and found the parameters MaxRuntimeInSeconds and MaxWaitTimeInSeconds:

  • MaxRuntimeInSeconds: The maximum length of time, in seconds, that a training or compilation job can run.
  • MaxWaitTimeInSeconds: The maximum length of time, in seconds, that a managed Spot training job has to complete. It is the amount of time spent waiting for Spot capacity plus the amount of time the job can run. It must be equal to or greater than MaxRuntimeInSeconds . If the job does not complete during this time, Amazon SageMaker ends the job.

I guess that:

  • max_run = MaxRuntimeInSeconds
  • max_wait = MaxWaitTimeInSeconds

max_run

My question is: about max_run and in the case of the training of a Hugging Face model with Trainer(), what is a “training or compilation job”?

  1. Is it the whole training job from the first step to the final one (ie, all the epochs)?
  2. Is it the training job by checkpoint (by epoch or by the number of defined steps) and separately, the evaluation job at the end of each checkpoint?

max_wait

My question is: about max_wait and in the case of the training of a Hugging Face model with Trainer(), what is a “managed Spot training job”?

  1. (same question 1 as for max_run) whole training job?
  2. (same question 2 as for max_run) training job by checkpoint?

cc @philschmid

See the documentation here Estimators — sagemaker 2.72.1 documentation - sorry if it’s hard to find, I’ll circulate the feedback on our side

Both max_run and max_wait are SageMaker Training parameters, they have no connection with Hugging Face. They control how your job (Hugging Face or other code) behaves when using Spot capacity (use_spot_instances=True)

EC2 Spot is spare compute capacity available at a discount. Read more here: Amazon EC2 Spot – Save up-to 90% on On-Demand Prices
The 2 most important things to be aware about Spot are:

  1. Capacity is variable, becaused based on spare compute. It’s better to use it opportunistically and be ready to fall-back to on-demand (or be ready to wait for availability) vs having expectations to be able to use any amount of Spot all the time and at any time
  2. Spot capacity can be reclaimed. Workloads running on Spot should either be quick enough not to worry about interruptions, or have interruption-handling mechanisms built-in

In SageMaker, you can decide to run your job on EC2 Spot capacity, and get up to 90% savings vs the public on-demand price (in my personal experience, it’s often around 70%, as reported here).
The SageMaker Spot is called “Managed Spot”, because it is easier to use than raw EC2 Spot:

  • You just need to specify 3 parameters (use_spot_instances, max_wait, max_run)
  • SageMaker has a checkpoint feature: you can read and write from the checkpoint location to persist and recover state across Spot interruptions (anything written to the local checkpoint location goes immediately to S3, and a fresh job loads to the checkpoint location the content of the S3 checkpoint location). Note that it is however your responsibility to know what to save and how to read so that the interruptions are handled gracefully. SageMaker Checkpoint just sync your files with S3 and reloads them post-interruption ; it is not aware of how those files should be handled, this should be done by your code.

In particular:

  • max_run is the max cumulative duration of the SageMaker Training job
  • max_wait is the max time you are going to wait for your job to complete (max_run + Spot waiting time)

And max_wait must but larger than max_run

1 Like

@OlivierCR, I read all this information on the AWs SageMaker doc but it does not answer my questions relatively of the Trainer() job of Hugging Face (see my post).

@pierreguillou To rephrase what @OlivierCR said:
max_run : answer 1: the whole training job
max_wait: none of your answers. This is the max time to get a spot instance allocated, plus the max_run

Hi @CyranoB.

This means that before starting a Hugging Face model training/fine-tuning on a AWS SageMakerTraining Spot instance, I need to calculate the duration of that training job (sum of training times + sum of evaluation times). Is that what you said?

How do you do this?

no you don’t need to do that, this is a maximum boundary that you set.
By default it is one day (86400s). Think about it more like an architectural good practice: it allows you to set a max duration to your training jobs, so that things don’t run indefinitely and incur charges

but to answer your question on how to estimate the duration of a training, just train for a few epochs (or iterations if an epoch is too long), then scale to the desired number of iterations or epochs.
mini-batch SGD has this advantage that it is a double for loop (loop through the dataset within a loop though epochs), so that its duration is quite predictable once you know the duration of a single iteration or a single epoch.

Thanks but I don’t understand. For example, I’m fine-tuning a T5 large with a dataset of millions of data. I can guarantee that all my training work will last more than 1 day (2 or 3 I think) on a ml.p3.2xlarge instance.

If I can’t calculate the whole training time but I have to setup a max_run, it means I will kill my whole training job before its end… and clearly, I do not want to do that.

Ok … I get it but it looks a bit handmade, right? (when the goal of Hugging face and AWS SageMaker is to help automate the use of transformers)

yes good points :slight_smile:

The max duration of a SageMaker Training job is 5 days. So you could set this as max_run if you want to run for as long as possible. If you want to run something longer, you can try chaining together multiple checkpointed training jobs (I haven’t seen it done yet though), for example using a Lambda launching job N+1 when triggered by the notification emitted when job N stops

you can take a big enough max_run. (less than 5 days though) Or if you checkpoint your training state, you can restart it later from another training job, so that interruptions don’t really matter.

Also note that if you train things for multiple days on one p3.2xlarge and want to go faster, it’s worth considering data-parallel training (example here https://huggingface.co/blog/sagemaker-distributed-training-seq2seq)

1 Like

Agree! there is opportunity for innovation here. I’m not aware of any tool that can predict the training time and resource consumption of a training script given an arbitrary dataset, an arbitrary model graph and an arbitrary training cluster. But why not tackling it as a regression problem, trained on few previous training jobs. I haven’t seen it done much though (there was an attempt to do it for Apache Spark). Could be a unicorn startup idea :wink:

Thanks @OlivierCR for your answers.

Are we talking here of any AWS SageMaker Training instance, and not only AWS SageMaker Spot Training Instance?

I’m asking this because I’ve never seen in the Hugging Face AWS SageMaker notebooks the necessity of setting up a max_run parameter (but the one about Spot instances).

I thought that a AWS SageMaker Training instance stopped only at the end of the whole training job (for example, I do not see max_run in the HF doc “Create a Hugging Face Estimator”).

What do you think?

Hi, yes this max_run is for anything running in SageMaker Training (HF DLC, but also built-in algos, other framework containers like Scikit, XGBoost, etc and bring-your-own docker)

It is an optional parameter of the Python SDK Estimators, set by default to 86400s (1 day). Customer code (including HF) most often takes less than a day to run, which is why people don’t know about that parameter and rarely override it.

I see in the doc about MaxRuntimeInSeconds the 1 day as default value but it says 28 days in maximum:

The default value is 1 day. The maximum value is 28 days.

And when I search for the 5 days you mention in your post, I find this page in AWS SageMaker doc that confirms it:

Longest run time for a processing job: 5 days

What is the right value as a maximum for a training/processing job?

1 Like

good catch ; I’m checking with colleagues. It’s possible that the limit was recently increased ; I’ll update the thread

so it seems that the 5-day limit is a default quota that can be increased, while the 28-day limit is the actual limit in the API.

2 Likes