Error when saving model in accelerate

Accelerate: 0.9.0
Python: 3.9.7
I have written the following class method that saves off a model trained when using accelerate:

def save_model(self, model_to_save: torch.nn.Module, model_save_path: str): 
        if self.is_accelerate and self.accelerator.is_local_main_process: 
            self.accelerator.wait_for_everyone()
            unwrapped_model = self.accelerator.unwrap_model(model_to_save)
            self.accelerator.save(unwrapped_model.state_dict(), model_save_path)

        else:
            torch.save(model_to_save.state_dict(), model_save_path)

However, when I run the training and try to save off the model, I get the following error:

wandb: Waiting for W&B process to finish... (success).
wandb:                                                                                
wandb: Waiting for W&B process to finish... (success).
wandb:                                                                                
wandb: Waiting for W&B process to finish... (success).
wandb: Waiting for W&B process to finish... (success).ed)
wandb:                                                                                
   EPOCH 1/1:  25%|█████████████████████████████▊                                                                                         | 10608/42381 [05:10<14:50, 35.67it/s]wandb:                                                                                
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/aclifton/rf_fp/wandb/offline-run-20220705_111956-286nvn1v
wandb: Find logs at: ./wandb/offline-run-20220705_111956-286nvn1v/logs
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/aclifton/rf_fp/wandb/offline-run-20220705_111956-880zdcay
wandb: Find logs at: ./wandb/offline-run-20220705_111956-880zdcay/logs
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/aclifton/rf_fp/wandb/offline-run-20220705_111956-3v2m38kt
wandb: Find logs at: ./wandb/offline-run-20220705_111956-3v2m38kt/logs
wandb: 
wandb: Run history:
wandb:  accuracy ▁
wandb:        f1 ▁
wandb:      loss █▆▃▂▃▁▁▁▂▁▁▂▂▂▃▂▁▂▂▂▂▂▁▁▂▂▁▁▁▁▁▁▁▁▁▂▂▂▂▂
wandb: precision ▁
wandb:    recall ▁
wandb: 
wandb: Run summary:
wandb:  accuracy 0.00536
wandb:        f1 0.00052
wandb:      loss 82.67569
wandb: precision 0.00028
wandb:    recall 0.00723
wandb: 
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/aclifton/rf_fp/wandb/offline-run-20220705_111956-1b3t18jh
wandb: Find logs at: ./wandb/offline-run-20220705_111956-1b3t18jh/logs
INFO: WandB run closed
INFO: eval time = 13.744579076766968 seconds
INFO: Finished eval
INFO: ----------------------------------------------------------------------------------------------------
INFO: SAVING MODEL TO ./models/full_model


[E ProcessGroupNCCL.cpp:719] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2769, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1806079 milliseconds before timing out.
INFO: SAVED MODEL TO ./models/full_model/rffp_encoder_t2v_full_model.pt
----------------------------------------------------------------------------------------------------
INFO: SAVING EXPERIMENTS DF
INFO: fields recorded for experiment: ['list_file_path', 'data_config_path', 'data_preprocess_method', 'n_examples_tot', 'train_size', 'test_size', 'n_train', 'n_test', 'n_labels', 'size_of_dataset', 'model_config_path', 'model_name', 'd_input', 't2v_activation_func', 'embedding_size', 'n_self_attention_heads', 'n_encoder_blocks', 'n_params', 'dropout_p', 'activation', 'loss_fn', 'model_dump', 'trainer_config_path', 'batch_size', 'n_epochs', 'n_training_steps', 'optimizer', 'learning_rate', 'lr_scheduler_type', 'lr_n_warmup_steps', 'train_time', 'test_accuracy', 'test_f1', 'test_recall', 'test_precision', 'output_dir', 'model_save_name']
----------------------------------------------------------------------------------------------------
INFO: YO DONE!!!!!
[E ProcessGroupNCCL.cpp:406] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2769, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1806079 milliseconds before timing out.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 3460319) of binary: /home/aclifton/anaconda3/envs/rffp/bin/python
Traceback (most recent call last):
  File "/home/aclifton/anaconda3/envs/rffp/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
========================================================
run_training.py FAILED
--------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-07-05_11:56:01
  host      : silver-surfer.airlab.com
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 3460319)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 3460319
========================================================
Traceback (most recent call last):
  File "/home/aclifton/anaconda3/envs/rffp/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
    args.func(args)
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/accelerate/commands/launch.py", line 528, in launch_command
    multi_gpu_launcher(args)
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/accelerate/commands/launch.py", line 279, in multi_gpu_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['torchrun', '--nproc_per_node', '4', 'run_training.py']' returned non-zero exit status 1.

I’m not sure how to interpret the error. Does anyone have any advice about what I might’ve messed up? Thanks in advance for your help!

Here’s what you should really do instead to make sure it all works well. This is how we have Accelerator.save_state currently as well:

def save_model(self, accelerator, model_to_save, model_save_path):
  state = self.accelerator.get_state_dict(model_to_save) # This will call the unwrap model as well
  self.accelerator.save(state, model_save_path)

Can you try this and see if it fixes your problem?

1 Like

If I have Accelerate set up to run on 4 GPUs, will the above solution save off a single model or do I need to indicate that accelerate should wait until all the processes are done calculating?

It will save a single model. get_state_dict will perform all of those bits for you so you don’t have to worry about that

That worked perfectly. Thanks!

@muellerzr I am using accelerate alongwith deepspeed within the trl library. I used your method of saving the model but I still get the NCCL timeout error after a few training steps.





I am training with 4 GPUs on the same machine.


I’ve even tried increasing the timeout with kwargs_handlers=[InitProcessGroupKwargs(timeout=timedelta(seconds=5_400))] as an argument to Accelerate during initialization. But for some reason, this doesn’t seem to take effect and timeout still occurs.


Using accelerate 0.16.0 and torch 2.0.0


Can you please help with this?