Distributed training on just cpu on a single node

ruta-8968 · November 21, 2023, 6:55am

How to use ddp_backend=“gloo” for trainer for running distributed training on multiple docker containers on a single node?

Do I still need to initialize with dist.init_process_group within the script?
What would be the world size while running multiple containers on a single node?

Topic		Replies	Views
Run training script in DDP using GLOO Intermediate	1	1952	August 17, 2022
Trainer API for data parallel on multi-node 🤗Transformers	4	96	February 6, 2025
Which data parallel does trainer use? DP or DDP? 🤗Transformers	2	6352	August 17, 2022
Distributed training on different gpus Beginners	0	221	August 30, 2023
Does the HF Trainer class support multi-node training? Beginners	4	2473	January 9, 2023