How to access to /opt/ml/model before the end of the model training?

Hi,

In hyperparameters of my notebook in AWS SageMaker & HF Training DLC, I defined:
'output_dir': '/opt/ml/model'

In the CloudWatch logs, I can see for example:
Model weights saved in /opt/ml/model/checkpoint-6000/pytorch_model.bin

I would like to test this checkpoint-6000 in another notebook without waiting for the end of the model training.

How can I access to the content of /opt/ml/model/checkpoint-6000/ within the AWS site or inside another notebook? (I do not want to use aws cli in a terminal).

Thanks.

Hi @pierreguillou ; the content of opt/ml/model is accessible only at the end of training (and it will be compressed in a model.tar.gz, which will be enormous if you save dozen/hundreds of transformers checkpoints there). If you want to export models to s3 during training, without interruption, in the same file format as saved locally, you need to save them in /opt/ml/checkpoints, and specify the S3 sync location in your SDK call with the parameter checkpoint_s3_uri. Then use the AWS CLI or boto3 download_file to bring the from S3 back to your notebook.

See here the doc of the checkpoint feature, which is in my opinion one of the best features of SageMaker

Could you provide a code for doing that in a jupyter notebook?

in a notebook cell, write:

! aws s3 cp <s3 URI of a checkpoint> <folder in local machine>

As done here in the bottom of this PyTorch SageMaker sample I created no later than yesterday

1 Like

with boto3:

s3 = boto3.resource("s3")

s3.meta.client.download_file(
    Key=<key ; S3 arn without the bucket>,
    Filename=<how you want to name the file locally,
    Bucket=<s3 bucket>)
1 Like