Training model file too large and fail to deploy

jackieliu930 · October 11, 2021, 6:50am

hi, community!

I have trained a Bert classification model with below config on sagemaker:

from sagemaker.huggingface import HuggingFace

# hyperparameters, which are passed into the training job
hyperparameters={'epochs': 5,
                 'train_batch_size': 4,
                 'model_name':'bert-base-uncased',
                 'num_labels':31,
                 }

huggingface_estimator = HuggingFace(entry_point='train.py',
                            source_dir='./scripts',
                            instance_type='ml.p3.16xlarge',
                            instance_count=1,
                            role=role,
                            # add volume size set up
                            volume_size=200,
                            transformers_version='4.6',
                            pytorch_version='1.7',
                            py_version='py36',
                            hyperparameters = hyperparameters)
# starting the train job with our uploaded datasets as input
huggingface_estimator.fit({'train': training_input_path, 'test': test_input_path})

this works perfectly. however, I have two problems:

the generated model file is huge: 64gb. why is that?
I can’t deploy endpoint successfully, I am guessing due to the model size too big, any idea on how to solve?

philschmid · October 11, 2021, 11:39am

Hey @jackieliu930,

Yes, I guess your artifact is so big because all saved checkpoints during training are included. You can either change your checkpoint saving strategy in your train.py or the location where the checkpoints are saved.
Or you could load your model.tar.gz and remove all checkpoints from it and then upload it so s3 again. Documentation here: Deploy models to Amazon SageMaker
Another solution would be to upload your model to Models - Hugging Face and then deploy using HF_MODE_ID and HF_TASK.

kanandk · July 2, 2023, 11:04am

Hey @philschmid, I am using run_glue.py to fine tune a model and it generates a 150+gb model.tar.gz.
How do I change the the location where the checkpoints are saved.?

philschmid · July 3, 2023, 7:05am

You can adjust the save directory for you checkpoints by setting output_dir to a different path and then only save the last checkpoint into /opt/ml/model.

Topic		Replies	Views
Use my finetuned Bert Model in SageMaker BatchTransform Amazon SageMaker	4	2968	April 30, 2022
ClientErro:400 when using batch transformer for inference Amazon SageMaker	11	2222	January 13, 2022
Endpoint Deployment Amazon SageMaker	1	1109	September 20, 2021
Deploying Mixtral8x7B on AWS Sagemaker from S3 Amazon SageMaker	2	481	June 11, 2024
The expanded size of the tensor (22528) must match the existing size (1024) at non-singleton dimension 0 Models	0	1489	July 25, 2023

Training model file too large and fail to deploy

Related topics