InternalServerError after model training finishes, but fails to upload?

I’m using HuggingFace/SageMaker to fine-tune a distilBert model. Lately, I’ve been running into an issue where model training/evaluation finishes 100% without any errors, gets stuck for a few hours during the model uploading process, and then the job fails with the following error message:

I didn’t see any indication of why the job would fail in the logs I have access to (e.g. training/eval fully finishes, no issues with CUDA/memory, oddities in the data, etc) and AWS support doesn’t seem to have a clue either.


This issue doesn’t happen when the model trains on a subset of the available training data (e.g. using 30-50% of the available training data to train) and only seems to occur when training with all the available training data - same model, same config, same instances, etc. So at first, I thought it had to do with S3 checkpoints and distributed training since this only happens when training on our larger dataset.

I’m using 2x of the ml.p4d.24xlarge instances with distributed training for this job. I did see that AWS had a document on model parallel troubleshooting and have tried their suggestions of removing the debug hook config + disabling checkpointing but no luck either.

Here’s my estimator config, just in case:

huggingface_estimator = HuggingFace(entry_point='train.py',
                            source_dir='./scripts',
                            instance_type=instance_type,
                            instance_count=instance_count,
                            base_job_name='test-run-no-debug-no-checkpoints',
                            # checkpoint_s3_uri=f's3://{sess.default_bucket()}/checkpoints',
                            # use_spot_instances=True,
                            # max_wait=(2*24*60*60),
                            max_run=(2*24*60*60),
                            volume_size=volume_size,
                            role=role,
                            transformers_version='4.6.1',
                            pytorch_version='1.7.1',
                            py_version='py36',
                            hyperparameters = hyperparameters,
                            distribution = distribution,
                            debugger_hook_config=False)

I’m not sure what’s causing this issue and was wondering if anyone has any insight about this?

Hello @nreamaroon,

thank you for opening the thread! That is indeed strange.
Did you use your own train.py or already created an example? If you wrote your own one can you share your training script?
Especially, which saving strategy you use. And could you also share the size of the dataset and model you use and the hyperparamter?

The only understandable reason for me currently is that during training on the large dataset SageMaker is creating this many checkpoints that it somehow fails uploading at the end. But this wouldn’t make any sense at all.
An easy way to test this would be only saving the last model in opt/ml/model.

In Addition to this, I have shared your issue with the AWS Team directly!

@nreamaroon I got a response from AWS. They asked if you can provide the full cloudwatch logs?

They said it looks like it’s a data agent issue.

Thanks for following up!

I’m using my own train.py based on this example. It’s been modified from the example to replicate a setup I’ve been using before moving to SageMaker, specifically:

  1. Use the DistilBertTokenizer from a custom vocab file (instead of AutoTokenizer)

  2. Include DataCollatorForLanguageModeling for fine-tuning a masked language model

  3. Replicate previous DistilBertConfig parameters (instead of AutoModelForMaskedLM)

However, beyond that, it largely follows the format from the example - so the saving strategy should be the same/default, which is presumably save_strategy="steps". Is there a way for me to to attach my train.py instead of just pasting it in plain text on here?

Anyway, the tokenized dataset is around 13 GB in total - processed in a separate step and saved as an .arrow file prior to submitting the training job.

Here are the hyper parameters I’m using:

hyperparameters={'model_name':'distilbert-base-uncased',
                 'epochs': 2,
                 'train_batch_size': 32, # 32 with ml.p4d.24xlarge
                 'eval_batch_size': 32,
                 }

And here is the model config:

config = DistilBertConfig(
    vocab_size = tokenizer.vocab_size, 
    max_position_embeddings=2048, # 512 default
    sinusoidal_pos_embds=False, 
    n_layers=6, 
    n_heads=12, 
    dim=768, 
    hidden_dim=3072, 
    dropout=0.1, 
    attention_dropout=0.1, 
    activation='gelu', 
    initializer_range=0.02, 
    qa_dropout=0.1, 
    seq_classif_dropout=0.2, 
    pad_token_id=0
    )

So based on what you’re saying, I’ll try changing the saving strategy to save_strategy="epoch" to reduce the number of checkpoints. To save just the last model in opt/ml/model, is it just as simple as passing save_total_limit=1 in the training argument?

I do have the CloudWatch logs, however I probably can’t paste it here since it’d be way too long. Is there a way I can send that directly to you or the AWS team you’re in contact with? (And if it’s indeed a problem with the data agent, do you have any idea on how I can resolve the problem?)

Thanks so much for helping!

Hey @nreamaroon,

Yes, using save_total_limit also works if it is okay for you to only have 1 additional checkpoint.
Sadly you cannot upload files here, but you could either create a repository on Models - Hugging Face and upload it manually.

Or upload it to your preferred storage provider, e.g. gdrive, and then share it.