How to access to /opt/ml/model before the end of the model training?

pierreguillou · December 9, 2021, 3:38pm

Hi,

In hyperparameters of my notebook in AWS SageMaker & HF Training DLC, I defined:
'output_dir': '/opt/ml/model'

In the CloudWatch logs, I can see for example:
Model weights saved in /opt/ml/model/checkpoint-6000/pytorch_model.bin

I would like to test this checkpoint-6000 in another notebook without waiting for the end of the model training.

How can I access to the content of /opt/ml/model/checkpoint-6000/ within the AWS site or inside another notebook? (I do not want to use aws cli in a terminal).

Thanks.

OlivierCR · December 9, 2021, 4:00pm

Hi @pierreguillou ; the content of opt/ml/model is accessible only at the end of training (and it will be compressed in a model.tar.gz, which will be enormous if you save dozen/hundreds of transformers checkpoints there). If you want to export models to s3 during training, without interruption, in the same file format as saved locally, you need to save them in /opt/ml/checkpoints, and specify the S3 sync location in your SDK call with the parameter checkpoint_s3_uri. Then use the AWS CLI or boto3 download_file to bring the from S3 back to your notebook.

See here the doc of the checkpoint feature, which is in my opinion one of the best features of SageMaker

pierreguillou · December 9, 2021, 6:47pm

Could you provide a code for doing that in a jupyter notebook?

OlivierCR · December 9, 2021, 7:48pm

in a notebook cell, write:

! aws s3 cp <s3 URI of a checkpoint> <folder in local machine>

As done here in the bottom of this PyTorch SageMaker sample I created no later than yesterday

OlivierCR · December 9, 2021, 7:50pm

with boto3:

s3 = boto3.resource("s3")

s3.meta.client.download_file(
    Key=<key ; S3 arn without the bucket>,
    Filename=<how you want to name the file locally,
    Bucket=<s3 bucket>)

Topic		Replies	Views
How to save model in S3 with Trainer? Intermediate	5	5054	May 26, 2023
Directly load models from a remote storage like S3 Amazon SageMaker	5	16393	November 18, 2022
How to write automatically the model card README.md into the file model.tar.gz? Amazon SageMaker	4	1106	October 26, 2022
How to export training logs from CloudWatch Amazon SageMaker	1	1716	December 13, 2021
Falcon 40B instruct training with QLora, Sagemaker model artifact location Amazon SageMaker	3	399	September 21, 2023

How to access to /opt/ml/model before the end of the model training?

Related topics