MaxRuntimeInSeconds: The maximum length of time, in seconds, that a training or compilation job can run.
MaxWaitTimeInSeconds: The maximum length of time, in seconds, that a managed Spot training job has to complete. It is the amount of time spent waiting for Spot capacity plus the amount of time the job can run. It must be equal to or greater than MaxRuntimeInSeconds . If the job does not complete during this time, Amazon SageMaker ends the job.
I guess that:
max_run = MaxRuntimeInSeconds
max_wait = MaxWaitTimeInSeconds
max_run
My question is: about max_run and in the case of the training of a Hugging Face model with Trainer(), what is a “training or compilation job”?
Is it the whole training job from the first step to the final one (ie, all the epochs)?
Is it the training job by checkpoint (by epoch or by the number of defined steps) and separately, the evaluation job at the end of each checkpoint?
max_wait
My question is: about max_wait and in the case of the training of a Hugging Face model with Trainer(), what is a “managed Spot training job”?
(same question 1 as for max_run) whole training job?
(same question 2 as for max_run) training job by checkpoint?
Both max_run and max_wait are SageMaker Training parameters, they have no connection with Hugging Face. They control how your job (Hugging Face or other code) behaves when using Spot capacity (use_spot_instances=True)
Capacity is variable, becaused based on spare compute. It’s better to use it opportunistically and be ready to fall-back to on-demand (or be ready to wait for availability) vs having expectations to be able to use any amount of Spot all the time and at any time
Spot capacity can be reclaimed. Workloads running on Spot should either be quick enough not to worry about interruptions, or have interruption-handling mechanisms built-in
In SageMaker, you can decide to run your job on EC2 Spot capacity, and get up to 90% savings vs the public on-demand price (in my personal experience, it’s often around 70%, as reported here).
The SageMaker Spot is called “Managed Spot”, because it is easier to use than raw EC2 Spot:
You just need to specify 3 parameters (use_spot_instances, max_wait, max_run)
SageMaker has a checkpoint feature: you can read and write from the checkpoint location to persist and recover state across Spot interruptions (anything written to the local checkpoint location goes immediately to S3, and a fresh job loads to the checkpoint location the content of the S3 checkpoint location). Note that it is however your responsibility to know what to save and how to read so that the interruptions are handled gracefully. SageMaker Checkpoint just sync your files with S3 and reloads them post-interruption ; it is not aware of how those files should be handled, this should be done by your code.
In particular:
max_run is the max cumulative duration of the SageMaker Training job
max_wait is the max time you are going to wait for your job to complete (max_run + Spot waiting time)
@OlivierCR, I read all this information on the AWs SageMaker doc but it does not answer my questions relatively of the Trainer() job of Hugging Face (see my post).
@pierreguillou To rephrase what @OlivierCR said: max_run : answer 1: the whole training job max_wait: none of your answers. This is the max time to get a spot instance allocated, plus the max_run
This means that before starting a Hugging Face model training/fine-tuning on a AWS SageMakerTraining Spot instance, I need to calculate the duration of that training job (sum of training times + sum of evaluation times). Is that what you said?
no you don’t need to do that, this is a maximum boundary that you set.
By default it is one day (86400s). Think about it more like an architectural good practice: it allows you to set a max duration to your training jobs, so that things don’t run indefinitely and incur charges
but to answer your question on how to estimate the duration of a training, just train for a few epochs (or iterations if an epoch is too long), then scale to the desired number of iterations or epochs.
mini-batch SGD has this advantage that it is a double for loop (loop through the dataset within a loop though epochs), so that its duration is quite predictable once you know the duration of a single iteration or a single epoch.
Thanks but I don’t understand. For example, I’m fine-tuning a T5 large with a dataset of millions of data. I can guarantee that all my training work will last more than 1 day (2 or 3 I think) on a ml.p3.2xlarge instance.
If I can’t calculate the whole training time but I have to setup a max_run, it means I will kill my whole training job before its end… and clearly, I do not want to do that.
Ok … I get it but it looks a bit handmade, right? (when the goal of Hugging face and AWS SageMaker is to help automate the use of transformers)
The max duration of a SageMaker Training job is 5 days. So you could set this as max_run if you want to run for as long as possible. If you want to run something longer, you can try chaining together multiple checkpointed training jobs (I haven’t seen it done yet though), for example using a Lambda launching job N+1 when triggered by the notification emitted when job N stops
you can take a big enough max_run. (less than 5 days though) Or if you checkpoint your training state, you can restart it later from another training job, so that interruptions don’t really matter.
Agree! there is opportunity for innovation here. I’m not aware of any tool that can predict the training time and resource consumption of a training script given an arbitrary dataset, an arbitrary model graph and an arbitrary training cluster. But why not tackling it as a regression problem, trained on few previous training jobs. I haven’t seen it done much though (there was an attempt to do it for Apache Spark). Could be a unicorn startup idea
Are we talking here of any AWS SageMaker Training instance, and not only AWS SageMaker Spot Training Instance?
I’m asking this because I’ve never seen in the Hugging Face AWS SageMaker notebooks the necessity of setting up a max_run parameter (but the one about Spot instances).
I thought that a AWS SageMaker Training instance stopped only at the end of the whole training job (for example, I do not see max_run in the HF doc “Create a Hugging Face Estimator”).
Hi, yes this max_run is for anything running in SageMaker Training (HF DLC, but also built-in algos, other framework containers like Scikit, XGBoost, etc and bring-your-own docker)
It is an optional parameter of the Python SDK Estimators, set by default to 86400s (1 day). Customer code (including HF) most often takes less than a day to run, which is why people don’t know about that parameter and rarely override it.