questions on tensor parallelism using pytorch

joepareti54 · December 2, 2024, 3:28pm

consider the following case: 8 gpus with ranks 0,1,2,3,4,5,6,7 (i)assume we implement tensor parallelism and data parallelism according to the following scheme : tensor group 0 includes ranks 0,1,2,3 ; tensor group 1 includes ranks 4,5,6,7 ; data parallelism [0, 4], [1, 5], [2, 6], [3, 7] . QUESTIONS :

given this scheme, how will the communication pattern be. Describe from the perspective of each gpu in tensor group 1
what is the difference between the above scheme and a tensor group with all gpus
consider rank #1 and rank#5: each should handle a distinct and unique portion of the dataset, and what is that portion of the dataset? 1/2 ? 1/4 ? 1/8 ?
at the end of the fwd pass, rank#5 will allreduce with rank#1 only?
what happens during the backward pass ?

Topic		Replies	Views
Clarifying multi-GPU memory usage Beginners	1	1415	November 5, 2020
Parallelize model call for TFBertModel 🤗Transformers	3	1035	January 7, 2021
Model parallel with deepspeed integration Beginners	0	650	September 14, 2021
Run_ner.py slower on multi-GPU than single GPU Beginners	1	1806	September 23, 2020
Model Parallelism, how to parallelize transformer? Beginners	3	12789	June 18, 2021

questions on tensor parallelism using pytorch

Related topics