I am training the model using multiple gpu what is the right way to save the checkpoing currently i am confused with how this works?
should i check is it the main process or not using
accelerate.is_main_processand save the state using
accelerate.save_state. when i do this only one random state is being stored
or irrespective of the process should i just call
accelerate.save_statewhen i do this it is save random state for all the 8 gpus
Which is the right way to do and what is your recommendations?