I’m using HuggingFace/SageMaker to fine-tune a distilBert model. Lately, I’ve been running into an issue where model training/evaluation finishes 100% without any errors, gets stuck for a few hours during the model uploading process, and then the job fails with the following error message:
I didn’t see any indication of why the job would fail in the logs I have access to (e.g. training/eval fully finishes, no issues with CUDA/memory, oddities in the data, etc) and AWS support doesn’t seem to have a clue either.
This issue doesn’t happen when the model trains on a subset of the available training data (e.g. using 30-50% of the available training data to train) and only seems to occur when training with all the available training data - same model, same config, same instances, etc. So at first, I thought it had to do with S3 checkpoints and distributed training since this only happens when training on our larger dataset.
I’m using 2x of the ml.p4d.24xlarge instances with distributed training for this job. I did see that AWS had a document on model parallel troubleshooting and have tried their suggestions of removing the debug hook config + disabling checkpointing but no luck either.
Here’s my estimator config, just in case:
huggingface_estimator = HuggingFace(entry_point='train.py',
source_dir='./scripts',
instance_type=instance_type,
instance_count=instance_count,
base_job_name='test-run-no-debug-no-checkpoints',
# checkpoint_s3_uri=f's3://{sess.default_bucket()}/checkpoints',
# use_spot_instances=True,
# max_wait=(2*24*60*60),
max_run=(2*24*60*60),
volume_size=volume_size,
role=role,
transformers_version='4.6.1',
pytorch_version='1.7.1',
py_version='py36',
hyperparameters = hyperparameters,
distribution = distribution,
debugger_hook_config=False)
I’m not sure what’s causing this issue and was wondering if anyone has any insight about this?