Understanding of TrainingJobAnalytics

Hello guys :wave:

I’m working on a Text Classification project, therefore I’m testing different models (with different hyper-parameters config) to see which one can give me the best results.

To be able to compare the different models, I’d like to retrieve the metrics computed during the training jobs. I followed what is explained in the different notebooks with the TrainingJobAnalytics method.

However, I’m not sure to understand which parameter controls the log step (if there’s any). Let me explain, I have this config:

parser.add_argument("--epochs", type=int, default=5)
parser.add_argument("--train-batch-size", type=int, default=16)
parser.add_argument("--eval-batch-size", type=int, default=32)
parser.add_argument("--model_name", type=str)
parser.add_argument("--learning_rate", type=str, default=2e-5)
parser.add_argument("--weight_decay", type=str, default=0.01)
parser.add_argument("--gradient_accumulation_steps", type=int, default=1)

with an epoch eval_strategy and logging_strategy. Therefore, I expect to have in output of the TrainingJobAnalytics 5 rows for each metric representing the value at each epoch as it’s displayed during the training process:

However what I have in output is that:

Is there a way to retrieve the same metrics displayed during the training job when calling the TrainingJobAnalytics ? Or the values retrieved are only based on the frequency at which Sagemaker sends events to Cloudwatch (which seems to be every 60s) ?

Thank you in advance :pray:

Hello @YannAgora,

There are two different options how you could progress:

  1. Use Weights & Biases or Tensorboard integration with Transformers and track you experiments like that.
  2. To use Sagemaker-experiments/modify the information extraction you need to define regex pattern from the information you want to extract from the logs and the provide them when creating your Training, see this example

Is there a way to retrieve the same metrics displayed during the training job when calling the TrainingJobAnalytics ? Or the values retrieved are only based on the frequency at which Sagemaker sends events to Cloudwatch (which seems to be every 60s) ?

The Sagemaker Experiment Metrics can be display in real time in Amazon SageMaker Studio or you can write a custom polling method yourself.

For W&B and TB they are reported in real-time to either the Hub or W&B platform.

2 Likes

Thank you @philschmid for the options you gave, I’ll try to dig into these :+1:

Meanwhile I found workaround by capturing the training logs and then applying regex on it to extract the different dictionaries. (Certainly not the most efficient solution but simple and working :sweat_smile:)

%%capture log
huggingface_estimator.logs()

then

items = re.findall(r"({.+})", log.stdout)
1 Like