hi, community!
I have trained a Bert classification model with below config on sagemaker:
from sagemaker.huggingface import HuggingFace
# hyperparameters, which are passed into the training job
hyperparameters={'epochs': 5,
'train_batch_size': 4,
'model_name':'bert-base-uncased',
'num_labels':31,
}
huggingface_estimator = HuggingFace(entry_point='train.py',
source_dir='./scripts',
instance_type='ml.p3.16xlarge',
instance_count=1,
role=role,
# add volume size set up
volume_size=200,
transformers_version='4.6',
pytorch_version='1.7',
py_version='py36',
hyperparameters = hyperparameters)
# starting the train job with our uploaded datasets as input
huggingface_estimator.fit({'train': training_input_path, 'test': test_input_path})
this works perfectly. however, I have two problems:
- the generated model file is huge: 64gb. why is that?
- I can’t deploy endpoint successfully, I am guessing due to the model size too big, any idea on how to solve?