Understanding of TrainingJobAnalytics

YannAgora · July 13, 2022, 2:18pm

Hello guys

I’m working on a Text Classification project, therefore I’m testing different models (with different hyper-parameters config) to see which one can give me the best results.

To be able to compare the different models, I’d like to retrieve the metrics computed during the training jobs. I followed what is explained in the different notebooks with the TrainingJobAnalytics method.

However, I’m not sure to understand which parameter controls the log step (if there’s any). Let me explain, I have this config:

parser.add_argument("--epochs", type=int, default=5)
parser.add_argument("--train-batch-size", type=int, default=16)
parser.add_argument("--eval-batch-size", type=int, default=32)
parser.add_argument("--model_name", type=str)
parser.add_argument("--learning_rate", type=str, default=2e-5)
parser.add_argument("--weight_decay", type=str, default=0.01)
parser.add_argument("--gradient_accumulation_steps", type=int, default=1)

with an epoch eval_strategy and logging_strategy. Therefore, I expect to have in output of the TrainingJobAnalytics 5 rows for each metric representing the value at each epoch as it’s displayed during the training process:

However what I have in output is that:

Is there a way to retrieve the same metrics displayed during the training job when calling the TrainingJobAnalytics ? Or the values retrieved are only based on the frequency at which Sagemaker sends events to Cloudwatch (which seems to be every 60s) ?

Thank you in advance

philschmid · July 14, 2022, 8:18am

Hello @YannAgora,

There are two different options how you could progress:

Use Weights & Biases or Tensorboard integration with Transformers and track you experiments like that.
To use Sagemaker-experiments/modify the information extraction you need to define regex pattern from the information you want to extract from the logs and the provide them when creating your Training, see this example

Is there a way to retrieve the same metrics displayed during the training job when calling the TrainingJobAnalytics ? Or the values retrieved are only based on the frequency at which Sagemaker sends events to Cloudwatch (which seems to be every 60s) ?

The Sagemaker Experiment Metrics can be display in real time in Amazon SageMaker Studio or you can write a custom polling method yourself.

For W&B and TB they are reported in real-time to either the Hub or W&B platform.

YannAgora · July 15, 2022, 9:04am

Thank you @philschmid for the options you gave, I’ll try to dig into these

Meanwhile I found workaround by capturing the training logs and then applying regex on it to extract the different dictionaries. (Certainly not the most efficient solution but simple and working )

%%capture log
huggingface_estimator.logs()

then

items = re.findall(r"({.+})", log.stdout)

Topic		Replies	Views
Training Metrics in AWS SageMaker Amazon SageMaker	5	2822	December 9, 2021
The inference stage metrics are not displayed in SageMaker studio Amazon SageMaker	3	889	June 14, 2022
Logs of training and validation loss Beginners	10	32592	February 14, 2025
Sagemaker - Finetuning model for translation Beginners	0	369	July 1, 2023
SageMaker Debugger Hugging Face Training Report Amazon SageMaker	3	903	July 7, 2022

Understanding of TrainingJobAnalytics

Related topics