I am training the model using multiple gpu what is the right way to save the checkpoing currently i am confused with how this works?
should i check is it the main process or not using accelerate.is_main_process and save the state using accelerate.save_state. when i do this only one random state is being stored
or irrespective of the process should i just call accelerate.save_state when i do this it is save random state for all the 8 gpus
Which is the right way to do and what is your recommendations?
I have exactly the same question. I launched the training on 3 GPUs, and used approach #1 as suggested by Deferring Executions
Now from the log, it seems that each one of the 3 processes can save the model since I see something like this in each one of the 3 log files:
Saved val model at epoch 833
Trained model saved to /home/<>/logs/tr_2023-12-24T12-23-10.708064/val_model_state_dict.pth
FYI I do the logging myself in the following way: I obtain the pid first, pid = self.accelerator.process_index, and then create a log file called something like f'log_{pid}.txt'. So I ended up having 3 log files named as
log_0.txt, log_1.txt, and log_2.txt. Now in each log file, I can see the model saving print message above.
FYI my saving code looks like below:
if self.accelerator.is_main_process:
model_dir = join(self.logdir, 'models')
trained_model = self.accelerator.unwrap_model(trained_model)
to_save = {'model_weights': trained_model.state_dict()}
torch.save(to_save, p)
self.log_info('Trained model saved to {}'.format(p))
What does this mean? I guess each one of the 3 processes can be the main process, so that all of the 3 processes can save the model. Am I correct?