Training Metrics in AWS SageMaker

Hi,

in the notebook 06_sagemaker_metrics / sagemaker-notebook.ipynb, there is the code to get training and eval metrics at the end of the training from the HuggingFaceEstimator.

How we can get them DURING the training?

Great, but I don’t understand how we can get them DURING the training to check how good (or not) the training is (for example, to detect overfitting and then, stop training before the last epoch).

My idea was to create a duplicate notebook (without running fit() in this duplicated one) for that purpose. The following text in the notebook seems to say that it is possible but how can we get specifiying the exact training job name in the TrainingJobAnalytics API call? Thanks.

Note that you can also copy this code and run it from a different place (as long as connected to the cloud and authorized to use the API), by specifiying the exact training job name in the TrainingJobAnalytics API call.)

Problem: “Warning: No metrics called eval_loss found”

I have a second question.

I used the metrics code (copy/paste) from 06_sagemaker_metrics / sagemaker-notebook.ipynb in a NER finetuning notebook on AWS SageMaker.

The code of my NER notebook uses directly the script run_ner.py from github (through the argument git_config in my Hugging Face Estimator).

metric_definitions=[
    {'Name': 'loss', 'Regex': "'loss': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'learning_rate', 'Regex': "'learning_rate': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_loss', 'Regex': "'eval_loss': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_accuracy', 'Regex': "'eval_accuracy': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_f1', 'Regex': "'eval_f1': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_precision', 'Regex': "'eval_precision': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_recall', 'Regex': "'eval_recall': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_runtime', 'Regex': "'eval_runtime': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_samples_per_second', 'Regex': "'eval_samples_per_second': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'epoch', 'Regex': "'epoch': ([0-9]+(.|e\-)[0-9]+),?"}]

git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.12.3'} 

huggingface_estimator = HuggingFace(
    entry_point          = 'run_ner.py',
    source_dir           = './examples/pytorch/token-classification',  
    git_config           = git_config,
    (...),
    metric_definitions   = metric_definitions,
)

I have no problem of training but when I want to display the metrics, most of them were not found (see the following screen shot):

I compared the code relative to the logs in the 2 scripts and they are different.

In the train.py:

  # Set up logging
   logger = logging.getLogger(__name__)

   logging.basicConfig(
        level=logging.getLevelName("INFO"),
        handlers=[logging.StreamHandler(sys.stdout)],
        format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
    )

In the run_ner.py:

logger = logging.getLogger(__name__)

# Setup logging
logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        handlers=[logging.StreamHandler(sys.stdout)],
    )

This is the reason of the problem?

You find them in the AWS Management console at the SageMaker Service → Training → Training job → Details View. Or you can use SageMaker Studio to inspect those. AWS Blog post

Problem: “Warning: No metrics called eval_loss found”

If you are using metrics_definitions sagemaker is scanning the stdout based on the regex schemes defined in metrics_definitions the scheme doesn’t match the output I cannot find any. So I guess for run_ner.py you might need to adjust the metrics_definitions regex scheme. Here is more documentation for that: Amazon SageMaker

@pierreguillou - how to find the metrics - in addition to dashboarding them in Cloudwatch via the path provided by Philipp (job detail page in the console, then “algorithm metrics” link in the bottom), you can also pull them in real time with the SDK

from sagemaker.analytics import TrainingJobAnalytics

df = TrainingJobAnalytics( training_job_name="jobname").dataframe()

2 Likes

Hi @OlivierCR.

You’re right: I did create another notebook in my AWS notebook instance where a training notebook is running and I copied/pasted the metrics code from the HF sagemaker notebook about this topic.

The main one is what you posted:

from sagemaker.analytics import TrainingJobAnalytics
df = TrainingJobAnalytics( training_job_name="jobname").dataframe()

My problem was as following:

In my training notebook, I did create a job_name as @philschmid did in the notebook workshop_1_getting_started_with_amazon_sagemaker/lab_1_default_training.ipynb, job_name that I used in my HuggingFaceEstimator as following:

# define Training Job Name 
import time
job_name = f'huggingface-workshop-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'

# create the Estimator
huggingface_estimator = HuggingFace(
    base_job_name        = job_name,          # the name of the training job
   (...)
)

For example, this code gave job_name = 'huggingface-workshop-2021-12-09-18-03-30'

However, this job_name did not work in the metrics code (at the top of the post).

Then, I was to AWS SageMaker console >> Training Jobs and saw a name slightly different that I copied/pasted as job_name… and it worked!

Why the AWS SageMaker console does change the training jobs names?

oh yes good catch! with the Estimator class (and its children like HuggingFaceEstimator) of the high-level Python SDK, you can’t set the job name ; you only set a prefix with base_job_name. The SDK then appends a timestamp to make it unique. SM Training jobs have to be unique, and I guess the SDK team preferred to handle that uniqueness management instead of forcing users to deal with it.

  • if you want to control exact job name, you need to launch jobs with boto3 create_training_job as shown here
  • if you want to know the full job name of a job launched from an SDK Estimator you can do Estimator.latest_training_job.name
1 Like

I love this last solution :slight_smile: Many thanks for your explanation and help!

1 Like