Save custom objects in the state for each process

As of now custom registered objects are not saved per process as seen here. Is there a way to save a separate checkpoint and load for each of the register custom objects per process/rank in a distributed setting?

Alternatively, do you suggest that we should gather everything before saving and then distribute during loading?

I’m not fully understanding how it’s not supported by just saving what you want to save outside of Accelerate right now (so each process will save its own version). Could you tell us more about your use case?

Thanks for the answer!

I’m not fully understanding how it’s not supported by just saving what you want to save outside of Accelerate right now (so each process will save its own version).

Yes, I think it should work. We did not take this strategy and chose to register_for_checkpointing the object with accelerate.

Could you tell us more about your use case?

Sure, in our case, we have a custom object that we use to follow the progress of the dataloader on each rank and each worker id so that we can resume the training where it was stopped the previous time.

Ok, so in this case it does seem easier to not register the object for checkpointing and save/load it manually (using the process index in the name of the save somehow, so you know which saved file to pick when reloading).

3 Likes

Thanks a lot for your feedback!