Model and data parallelism when training on multiple GPUs?

nicoberk · January 22, 2025, 9:13am

Hi there!

I want to train a model from scratch using a dataset of ~35M online comments. Given the computational costs, I hope to set up the training as efficiently as possible, including parallelization.

Following this guide, I see that I can use both model and data parallelism. I understand that if I run training on, let’s say, 4 different GPUs, I’d probably want BOTH parallelized processing of the data and training of the model, correct? However, the guide discusses them as alternative options.

I will try to combine the options described in the guide using

distribution={
    "smdistributed": {
        "modelparallel": smp_options,
        'dataparallel':{ 'enabled': True }
        },
    "mpi": mpi_options
}

However, before I burn large amounts of money, it would be great to get some feedback on whether this makes sense. Any general intuition to think about this would also be greatly appreciated.

Topic		Replies	Views
Clarifying multi-GPU memory usage Beginners	1	1425	November 5, 2020
Where is SageMaker Distributed configured in HF Trainer? Amazon SageMaker	2	512	May 6, 2021
Multi gpu training 🤗Transformers	3	6058	April 24, 2022
Model parallel with deepspeed integration Beginners	0	652	September 14, 2021
Model training in Multi GPU 🤗Transformers	1	1840	March 17, 2021

Model *and* data parallelism when training on multiple GPUs?

Related topics

Model and data parallelism when training on multiple GPUs?