Error when saving model in accelerate

Accelerate: 0.9.0
Python: 3.9.7
I have written the following class method that saves off a model trained when using accelerate:

def save_model(self, model_to_save: torch.nn.Module, model_save_path: str): 
        if self.is_accelerate and self.accelerator.is_local_main_process: 
            self.accelerator.wait_for_everyone()
            unwrapped_model = self.accelerator.unwrap_model(model_to_save)
            self.accelerator.save(unwrapped_model.state_dict(), model_save_path)

        else:
            torch.save(model_to_save.state_dict(), model_save_path)

However, when I run the training and try to save off the model, I get the following error:

wandb: Waiting for W&B process to finish... (success).
wandb:                                                                                
wandb: Waiting for W&B process to finish... (success).
wandb:                                                                                
wandb: Waiting for W&B process to finish... (success).
wandb: Waiting for W&B process to finish... (success).ed)
wandb:                                                                                
   EPOCH 1/1:  25%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š                                                                                         | 10608/42381 [05:10<14:50, 35.67it/s]wandb:                                                                                
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/aclifton/rf_fp/wandb/offline-run-20220705_111956-286nvn1v
wandb: Find logs at: ./wandb/offline-run-20220705_111956-286nvn1v/logs
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/aclifton/rf_fp/wandb/offline-run-20220705_111956-880zdcay
wandb: Find logs at: ./wandb/offline-run-20220705_111956-880zdcay/logs
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/aclifton/rf_fp/wandb/offline-run-20220705_111956-3v2m38kt
wandb: Find logs at: ./wandb/offline-run-20220705_111956-3v2m38kt/logs
wandb: 
wandb: Run history:
wandb:  accuracy ▁
wandb:        f1 ▁
wandb:      loss β–ˆβ–†β–ƒβ–‚β–ƒβ–β–β–β–‚β–β–β–‚β–‚β–‚β–ƒβ–‚β–β–‚β–‚β–‚β–‚β–‚β–β–β–‚β–‚β–β–β–β–β–β–β–β–β–β–‚β–‚β–‚β–‚β–‚
wandb: precision ▁
wandb:    recall ▁
wandb: 
wandb: Run summary:
wandb:  accuracy 0.00536
wandb:        f1 0.00052
wandb:      loss 82.67569
wandb: precision 0.00028
wandb:    recall 0.00723
wandb: 
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/aclifton/rf_fp/wandb/offline-run-20220705_111956-1b3t18jh
wandb: Find logs at: ./wandb/offline-run-20220705_111956-1b3t18jh/logs
INFO: WandB run closed
INFO: eval time = 13.744579076766968 seconds
INFO: Finished eval
INFO: ----------------------------------------------------------------------------------------------------
INFO: SAVING MODEL TO ./models/full_model


[E ProcessGroupNCCL.cpp:719] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2769, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1806079 milliseconds before timing out.
INFO: SAVED MODEL TO ./models/full_model/rffp_encoder_t2v_full_model.pt
----------------------------------------------------------------------------------------------------
INFO: SAVING EXPERIMENTS DF
INFO: fields recorded for experiment: ['list_file_path', 'data_config_path', 'data_preprocess_method', 'n_examples_tot', 'train_size', 'test_size', 'n_train', 'n_test', 'n_labels', 'size_of_dataset', 'model_config_path', 'model_name', 'd_input', 't2v_activation_func', 'embedding_size', 'n_self_attention_heads', 'n_encoder_blocks', 'n_params', 'dropout_p', 'activation', 'loss_fn', 'model_dump', 'trainer_config_path', 'batch_size', 'n_epochs', 'n_training_steps', 'optimizer', 'learning_rate', 'lr_scheduler_type', 'lr_n_warmup_steps', 'train_time', 'test_accuracy', 'test_f1', 'test_recall', 'test_precision', 'output_dir', 'model_save_name']
----------------------------------------------------------------------------------------------------
INFO: YO DONE!!!!!
[E ProcessGroupNCCL.cpp:406] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2769, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1806079 milliseconds before timing out.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 3460319) of binary: /home/aclifton/anaconda3/envs/rffp/bin/python
Traceback (most recent call last):
  File "/home/aclifton/anaconda3/envs/rffp/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
========================================================
run_training.py FAILED
--------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-07-05_11:56:01
  host      : silver-surfer.airlab.com
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 3460319)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 3460319
========================================================
Traceback (most recent call last):
  File "/home/aclifton/anaconda3/envs/rffp/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
    args.func(args)
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/accelerate/commands/launch.py", line 528, in launch_command
    multi_gpu_launcher(args)
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/accelerate/commands/launch.py", line 279, in multi_gpu_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['torchrun', '--nproc_per_node', '4', 'run_training.py']' returned non-zero exit status 1.

I’m not sure how to interpret the error. Does anyone have any advice about what I might’ve messed up? Thanks in advance for your help!

Here’s what you should really do instead to make sure it all works well. This is how we have Accelerator.save_state currently as well:

def save_model(self, accelerator, model_to_save, model_save_path):
  state = self.accelerator.get_state_dict(model_to_save) # This will call the unwrap model as well
  self.accelerator.save(state, model_save_path)

Can you try this and see if it fixes your problem?

1 Like

If I have Accelerate set up to run on 4 GPUs, will the above solution save off a single model or do I need to indicate that accelerate should wait until all the processes are done calculating?

It will save a single model. get_state_dict will perform all of those bits for you so you don’t have to worry about that

That worked perfectly. Thanks!