What is the right way to save check point using accelerator while trainining on multiple gpus?

psk · September 10, 2023, 11:21pm

I am training the model using multiple gpu what is the right way to save the checkpoing currently i am confused with how this works?

should i check is it the main process or not using accelerate.is_main_process and save the state using accelerate.save_state. when i do this only one random state is being stored
or irrespective of the process should i just call accelerate.save_state when i do this it is save random state for all the 8 gpus

Which is the right way to do and what is your recommendations?

234f534v · December 31, 2023, 10:36pm

I have exactly the same question. I launched the training on 3 GPUs, and used approach #1 as suggested by Deferring Executions

Now from the log, it seems that each one of the 3 processes can save the model since I see something like this in each one of the 3 log files:

Saved val model at epoch 833
Trained model saved to /home/<>/logs/tr_2023-12-24T12-23-10.708064/val_model_state_dict.pth

FYI I do the logging myself in the following way: I obtain the pid first, pid = self.accelerator.process_index, and then create a log file called something like f'log_{pid}.txt'. So I ended up having 3 log files named as
log_0.txt, log_1.txt, and log_2.txt. Now in each log file, I can see the model saving print message above.

FYI my saving code looks like below:

        if self.accelerator.is_main_process:
            model_dir = join(self.logdir, 'models')
            trained_model = self.accelerator.unwrap_model(trained_model)
            to_save = {'model_weights': trained_model.state_dict()}
            torch.save(to_save, p)
            self.log_info('Trained model saved to {}'.format(p))

What does this mean? I guess each one of the 3 processes can be the main process, so that all of the 3 processes can save the model. Am I correct?

marcsun13 · January 24, 2024, 4:41pm

Hi, please have a look at the official doc and this example ! This should give you the answer you want.

Topic		Replies	Views
Error occurs when saving model in multi-gpu settings Beginners	4	1485	November 9, 2021
CUDA out of memory only when saving state Beginners	2	1356	April 29, 2022
Is the Trainer supposed to be saving checkpoints for every process? Beginners	0	11	July 20, 2024
Why is `accelerator.save` saving once for each node? 🤗Accelerate	2	620	August 31, 2022
Weird behavior when saving checkpoint in DDP 🤗Accelerate	0	49	August 11, 2024

What is the right way to save check point using accelerator while trainining on multiple gpus?

Related topics