Hello guys
I’m working on a Text Classification project, therefore I’m testing different models (with different hyper-parameters config) to see which one can give me the best results.
To be able to compare the different models, I’d like to retrieve the metrics computed during the training jobs. I followed what is explained in the different notebooks with the TrainingJobAnalytics
method.
However, I’m not sure to understand which parameter controls the log step (if there’s any). Let me explain, I have this config:
parser.add_argument("--epochs", type=int, default=5)
parser.add_argument("--train-batch-size", type=int, default=16)
parser.add_argument("--eval-batch-size", type=int, default=32)
parser.add_argument("--model_name", type=str)
parser.add_argument("--learning_rate", type=str, default=2e-5)
parser.add_argument("--weight_decay", type=str, default=0.01)
parser.add_argument("--gradient_accumulation_steps", type=int, default=1)
with an epoch
eval_strategy and logging_strategy. Therefore, I expect to have in output of the TrainingJobAnalytics
5 rows for each metric representing the value at each epoch as it’s displayed during the training process:
However what I have in output is that:
Is there a way to retrieve the same metrics displayed during the training job when calling the TrainingJobAnalytics
? Or the values retrieved are only based on the frequency at which Sagemaker sends events to Cloudwatch (which seems to be every 60s) ?
Thank you in advance