What is the right way to save check point using accelerator while trainining on multiple gpus?

I am training the model using multiple gpu what is the right way to save the checkpoing currently i am confused with how this works?

  1. should i check is it the main process or not using accelerate.is_main_process and save the state using accelerate.save_state. when i do this only one random state is being stored

  2. or irrespective of the process should i just call accelerate.save_state when i do this it is save random state for all the 8 gpus

Which is the right way to do and what is your recommendations?

I have exactly the same question. I launched the training on 3 GPUs, and used approach #1 as suggested by Deferring Executions

Now from the log, it seems that each one of the 3 processes can save the model since I see something like this in each one of the 3 log files:

Saved val model at epoch 833
Trained model saved to /home/<>/logs/tr_2023-12-24T12-23-10.708064/val_model_state_dict.pth

FYI I do the logging myself in the following way: I obtain the pid first, pid = self.accelerator.process_index, and then create a log file called something like f'log_{pid}.txt'. So I ended up having 3 log files named as
log_0.txt, log_1.txt, and log_2.txt. Now in each log file, I can see the model saving print message above.

FYI my saving code looks like below:

        if self.accelerator.is_main_process:
            model_dir = join(self.logdir, 'models')
            trained_model = self.accelerator.unwrap_model(trained_model)
            to_save = {'model_weights': trained_model.state_dict()}
            torch.save(to_save, p)
            self.log_info('Trained model saved to {}'.format(p))

What does this mean? I guess each one of the 3 processes can be the main process, so that all of the 3 processes can save the model. Am I correct?

Hi, please have a look at the official doc and this example ! This should give you the answer you want.