InternalServerError after model training finishes, but fails to upload?

nreamaroon · August 27, 2021, 6:05pm

I’m using HuggingFace/SageMaker to fine-tune a distilBert model. Lately, I’ve been running into an issue where model training/evaluation finishes 100% without any errors, gets stuck for a few hours during the model uploading process, and then the job fails with the following error message:

I didn’t see any indication of why the job would fail in the logs I have access to (e.g. training/eval fully finishes, no issues with CUDA/memory, oddities in the data, etc) and AWS support doesn’t seem to have a clue either.

This issue doesn’t happen when the model trains on a subset of the available training data (e.g. using 30-50% of the available training data to train) and only seems to occur when training with all the available training data - same model, same config, same instances, etc. So at first, I thought it had to do with S3 checkpoints and distributed training since this only happens when training on our larger dataset.

I’m using 2x of the ml.p4d.24xlarge instances with distributed training for this job. I did see that AWS had a document on model parallel troubleshooting and have tried their suggestions of removing the debug hook config + disabling checkpointing but no luck either.

Here’s my estimator config, just in case:

huggingface_estimator = HuggingFace(entry_point='train.py',
                            source_dir='./scripts',
                            instance_type=instance_type,
                            instance_count=instance_count,
                            base_job_name='test-run-no-debug-no-checkpoints',
                            # checkpoint_s3_uri=f's3://{sess.default_bucket()}/checkpoints',
                            # use_spot_instances=True,
                            # max_wait=(2*24*60*60),
                            max_run=(2*24*60*60),
                            volume_size=volume_size,
                            role=role,
                            transformers_version='4.6.1',
                            pytorch_version='1.7.1',
                            py_version='py36',
                            hyperparameters = hyperparameters,
                            distribution = distribution,
                            debugger_hook_config=False)

I’m not sure what’s causing this issue and was wondering if anyone has any insight about this?

philschmid · August 30, 2021, 7:36am

Hello @nreamaroon,

thank you for opening the thread! That is indeed strange.
Did you use your own train.py or already created an example? If you wrote your own one can you share your training script?
Especially, which saving strategy you use. And could you also share the size of the dataset and model you use and the hyperparamter?

The only understandable reason for me currently is that during training on the large dataset SageMaker is creating this many checkpoints that it somehow fails uploading at the end. But this wouldn’t make any sense at all.
An easy way to test this would be only saving the last model in opt/ml/model.

In Addition to this, I have shared your issue with the AWS Team directly!

philschmid · August 30, 2021, 11:27am

@nreamaroon I got a response from AWS. They asked if you can provide the full cloudwatch logs?

They said it looks like it’s a data agent issue.

nreamaroon · August 30, 2021, 10:09pm

Thanks for following up!

I’m using my own train.py based on this example. It’s been modified from the example to replicate a setup I’ve been using before moving to SageMaker, specifically:

Use the DistilBertTokenizer from a custom vocab file (instead of AutoTokenizer)
Include DataCollatorForLanguageModeling for fine-tuning a masked language model
Replicate previous DistilBertConfig parameters (instead of AutoModelForMaskedLM)

However, beyond that, it largely follows the format from the example - so the saving strategy should be the same/default, which is presumably save_strategy="steps". Is there a way for me to to attach my train.py instead of just pasting it in plain text on here?

Anyway, the tokenized dataset is around 13 GB in total - processed in a separate step and saved as an .arrow file prior to submitting the training job.

Here are the hyper parameters I’m using:

hyperparameters={'model_name':'distilbert-base-uncased',
                 'epochs': 2,
                 'train_batch_size': 32, # 32 with ml.p4d.24xlarge
                 'eval_batch_size': 32,
                 }

And here is the model config:

config = DistilBertConfig(
    vocab_size = tokenizer.vocab_size, 
    max_position_embeddings=2048, # 512 default
    sinusoidal_pos_embds=False, 
    n_layers=6, 
    n_heads=12, 
    dim=768, 
    hidden_dim=3072, 
    dropout=0.1, 
    attention_dropout=0.1, 
    activation='gelu', 
    initializer_range=0.02, 
    qa_dropout=0.1, 
    seq_classif_dropout=0.2, 
    pad_token_id=0
    )

So based on what you’re saying, I’ll try changing the saving strategy to save_strategy="epoch" to reduce the number of checkpoints. To save just the last model in opt/ml/model, is it just as simple as passing save_total_limit=1 in the training argument?

I do have the CloudWatch logs, however I probably can’t paste it here since it’d be way too long. Is there a way I can send that directly to you or the AWS team you’re in contact with? (And if it’s indeed a problem with the data agent, do you have any idea on how I can resolve the problem?)

Thanks so much for helping!

philschmid · August 31, 2021, 6:53am

Hey @nreamaroon,

Yes, using save_total_limit also works if it is okay for you to only have 1 additional checkpoint.
Sadly you cannot upload files here, but you could either create a repository on Models - Hugging Face and upload it manually.

Or upload it to your preferred storage provider, e.g. gdrive, and then share it.

Topic		Replies	Views
InternalServerException when running a model loaded on S3 Amazon SageMaker	4	984	August 6, 2021
Models is not saved in S3 bucket location Amazon SageMaker	0	307	April 9, 2024
Training model file too large and fail to deploy Amazon SageMaker	3	1377	July 3, 2023
ClientError: Artifact upload failed:Error 5 Amazon SageMaker	6	2403	December 1, 2021
Getting error in the inference stage of Transformers Model (Hugging Face) 🤗Transformers	0	782	October 11, 2022

InternalServerError after model training finishes, but fails to upload?

Related topics