Accelerate: 0.9.0
Python: 3.9.7
I have written the following class method that saves off a model trained when using accelerate
:
def save_model(self, model_to_save: torch.nn.Module, model_save_path: str):
if self.is_accelerate and self.accelerator.is_local_main_process:
self.accelerator.wait_for_everyone()
unwrapped_model = self.accelerator.unwrap_model(model_to_save)
self.accelerator.save(unwrapped_model.state_dict(), model_save_path)
else:
torch.save(model_to_save.state_dict(), model_save_path)
However, when I run the training and try to save off the model, I get the following error:
wandb: Waiting for W&B process to finish... (success).
wandb:
wandb: Waiting for W&B process to finish... (success).
wandb:
wandb: Waiting for W&B process to finish... (success).
wandb: Waiting for W&B process to finish... (success).ed)
wandb:
EPOCH 1/1: 25%|█████████████████████████████▊ | 10608/42381 [05:10<14:50, 35.67it/s]wandb:
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/aclifton/rf_fp/wandb/offline-run-20220705_111956-286nvn1v
wandb: Find logs at: ./wandb/offline-run-20220705_111956-286nvn1v/logs
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/aclifton/rf_fp/wandb/offline-run-20220705_111956-880zdcay
wandb: Find logs at: ./wandb/offline-run-20220705_111956-880zdcay/logs
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/aclifton/rf_fp/wandb/offline-run-20220705_111956-3v2m38kt
wandb: Find logs at: ./wandb/offline-run-20220705_111956-3v2m38kt/logs
wandb:
wandb: Run history:
wandb: accuracy ▁
wandb: f1 ▁
wandb: loss █▆▃▂▃▁▁▁▂▁▁▂▂▂▃▂▁▂▂▂▂▂▁▁▂▂▁▁▁▁▁▁▁▁▁▂▂▂▂▂
wandb: precision ▁
wandb: recall ▁
wandb:
wandb: Run summary:
wandb: accuracy 0.00536
wandb: f1 0.00052
wandb: loss 82.67569
wandb: precision 0.00028
wandb: recall 0.00723
wandb:
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/aclifton/rf_fp/wandb/offline-run-20220705_111956-1b3t18jh
wandb: Find logs at: ./wandb/offline-run-20220705_111956-1b3t18jh/logs
INFO: WandB run closed
INFO: eval time = 13.744579076766968 seconds
INFO: Finished eval
INFO: ----------------------------------------------------------------------------------------------------
INFO: SAVING MODEL TO ./models/full_model
[E ProcessGroupNCCL.cpp:719] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2769, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1806079 milliseconds before timing out.
INFO: SAVED MODEL TO ./models/full_model/rffp_encoder_t2v_full_model.pt
----------------------------------------------------------------------------------------------------
INFO: SAVING EXPERIMENTS DF
INFO: fields recorded for experiment: ['list_file_path', 'data_config_path', 'data_preprocess_method', 'n_examples_tot', 'train_size', 'test_size', 'n_train', 'n_test', 'n_labels', 'size_of_dataset', 'model_config_path', 'model_name', 'd_input', 't2v_activation_func', 'embedding_size', 'n_self_attention_heads', 'n_encoder_blocks', 'n_params', 'dropout_p', 'activation', 'loss_fn', 'model_dump', 'trainer_config_path', 'batch_size', 'n_epochs', 'n_training_steps', 'optimizer', 'learning_rate', 'lr_scheduler_type', 'lr_n_warmup_steps', 'train_time', 'test_accuracy', 'test_f1', 'test_recall', 'test_precision', 'output_dir', 'model_save_name']
----------------------------------------------------------------------------------------------------
INFO: YO DONE!!!!!
[E ProcessGroupNCCL.cpp:406] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2769, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1806079 milliseconds before timing out.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 3460319) of binary: /home/aclifton/anaconda3/envs/rffp/bin/python
Traceback (most recent call last):
File "/home/aclifton/anaconda3/envs/rffp/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/run.py", line 724, in main
run(args)
File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/run.py", line 715, in run
elastic_launch(
File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
========================================================
run_training.py FAILED
--------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2022-07-05_11:56:01
host : silver-surfer.airlab.com
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 3460319)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 3460319
========================================================
Traceback (most recent call last):
File "/home/aclifton/anaconda3/envs/rffp/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
args.func(args)
File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/accelerate/commands/launch.py", line 528, in launch_command
multi_gpu_launcher(args)
File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/accelerate/commands/launch.py", line 279, in multi_gpu_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['torchrun', '--nproc_per_node', '4', 'run_training.py']' returned non-zero exit status 1.
I’m not sure how to interpret the error. Does anyone have any advice about what I might’ve messed up? Thanks in advance for your help!