Training model file too large and fail to deploy

philschmid · October 11, 2021, 11:39am

Yes, I guess your artifact is so big because all saved checkpoints during training are included. You can either change your checkpoint saving strategy in your train.py or the location where the checkpoints are saved.
Or you could load your model.tar.gz and remove all checkpoints from it and then upload it so s3 again. Documentation here: Deploy models to Amazon SageMaker
Another solution would be to upload your model to Models - Hugging Face and then deploy using HF_MODE_ID and HF_TASK.

Topic		Replies	Views
Sagemaker serverless endpoint deployment error (Image size greater than support size)) Amazon SageMaker	3	1236	July 21, 2023
Use my finetuned Bert Model in SageMaker BatchTransform Amazon SageMaker	4	2980	April 30, 2022
InternalServerError after model training finishes, but fails to upload? Amazon SageMaker	4	1136	August 31, 2021
Deploying Mixtral8x7B on AWS Sagemaker from S3 Amazon SageMaker	2	497	June 11, 2024
Error deploying BERT on SageMaker Amazon SageMaker	20	5297	April 1, 2025

Training model file too large and fail to deploy

Related topics