Training model file too large and fail to deploy

hi, community!

I have trained a Bert classification model with below config on sagemaker:

from sagemaker.huggingface import HuggingFace

# hyperparameters, which are passed into the training job
hyperparameters={'epochs': 5,
                 'train_batch_size': 4,
                 'model_name':'bert-base-uncased',
                 'num_labels':31,
                 }

huggingface_estimator = HuggingFace(entry_point='train.py',
                            source_dir='./scripts',
                            instance_type='ml.p3.16xlarge',
                            instance_count=1,
                            role=role,
                            # add volume size set up
                            volume_size=200,
                            transformers_version='4.6',
                            pytorch_version='1.7',
                            py_version='py36',
                            hyperparameters = hyperparameters)
# starting the train job with our uploaded datasets as input
huggingface_estimator.fit({'train': training_input_path, 'test': test_input_path})

this works perfectly. however, I have two problems:

  1. the generated model file is huge: 64gb. why is that?
  2. I can’t deploy endpoint successfully, I am guessing due to the model size too big, any idea on how to solve?

Hey @jackieliu930,

Yes, I guess your artifact is so big because all saved checkpoints during training are included. You can either change your checkpoint saving strategy in your train.py or the location where the checkpoints are saved.
Or you could load your model.tar.gz and remove all checkpoints from it and then upload it so s3 again. Documentation here: Deploy models to Amazon SageMaker
Another solution would be to upload your model to Models - Hugging Face and then deploy using HF_MODE_ID and HF_TASK.

Hey @philschmid, I am using run_glue.py to fine tune a model and it generates a 150+gb model.tar.gz.
How do I change the the location where the checkpoints are saved.?

You can adjust the save directory for you checkpoints by setting output_dir to a different path and then only save the last checkpoint into /opt/ml/model.