I am training the model using multiple gpu what is the right way to save the checkpoing currently i am confused with how this works?
-
should i check is it the main process or not using
accelerate.is_main_process
and save the state usingaccelerate.save_state
. when i do this only one random state is being stored -
or irrespective of the process should i just call
accelerate.save_state
when i do this it is save random state for all the 8 gpus
Which is the right way to do and what is your recommendations?