So if I understand correctly, for launching very large models such as BLOOM 176B for inference, I can do this via Accelerate as long as one node has enough GPUs and GPU memory. When I need to distribute to two nodes via some kind of model parallelism, I should better write a customized solution. I would appreciate whatever answer such that I can move on and try something different.
Thanks for your response! From the pull request, it seems like you are improving the memory balancing over the GPUs within a single node. This is great already! I am wondering whether this improved device map would also extend to multiple nodes? For example I have 2 nodes with 8 A6000 48GB each. I want to have the first half of layers assigned to the first node (= 8 A6000) and the second half of layers to the second node (another 8 A6000).