Accelerate: Consistency across devices when evolving a NN

nickua · July 4, 2023, 12:37pm

Hi! I’m using accelerate to perform distributed RL training. I’m using neuroevolution during training, meaning that at regular intervals my network architecture randomly mutates. The issue is that I’m ending up with different mutations of the same network on my different devices, e.g. we may randomly add a layer to the version of the model on GPU:0, but randomly remove a node from the version of the model running on GPU:1. The parameters are now different in shape. This obviously can’t work because we need consistency across devices so that we can actually do distributed training.

So my question is this: Is there a way we can ‘recombine’ across devices, and reset what the model looks like for all GPUs? I don’t think simply mutating the model on the ‘main’ device and then using accelerator.prepare would work, as this would only run on the main process. I think the best solution would be to somehow come together again across devices, mutate, and then redistribute, but I’m not sure how to do this.

Any suggestions are hugely appreciated! Thanks

Topic		Replies	Views
Using another model when training a model with accelerate on multi-GPUs 🤗Accelerate	1	1203	October 31, 2022
Worse performance using Accelerate 🤗Accelerate	0	1043	January 15, 2024
Multi-GPU Distributed Training using Accelerate on Windows 🤗Accelerate	0	1535	August 9, 2023
Multi-node training 🤗Accelerate	2	2939	January 16, 2023
Multi-gpu training does not optimize as expected Beginners	1	450	February 26, 2024

Accelerate: Consistency across devices when evolving a NN

Related topics