How to use ddp_backend=“gloo” for trainer for running distributed training on multiple docker containers on a single node?
- Do I still need to initialize with dist.init_process_group within the script?
- What would be the world size while running multiple containers on a single node?