Hi! I’m using accelerate to perform distributed RL training. I’m using neuroevolution during training, meaning that at regular intervals my network architecture randomly mutates. The issue is that I’m ending up with different mutations of the same network on my different devices, e.g. we may randomly add a layer to the version of the model on GPU:0, but randomly remove a node from the version of the model running on GPU:1. The parameters are now different in shape. This obviously can’t work because we need consistency across devices so that we can actually do distributed training.
So my question is this: Is there a way we can ‘recombine’ across devices, and reset what the model looks like for all GPUs? I don’t think simply mutating the model on the ‘main’ device and then using accelerator.prepare would work, as this would only run on the main process. I think the best solution would be to somehow come together again across devices, mutate, and then redistribute, but I’m not sure how to do this.
Any suggestions are hugely appreciated! Thanks