Model *and* data parallelism when training on multiple GPUs?

Hi there!

I want to train a model from scratch using a dataset of ~35M online comments. Given the computational costs, I hope to set up the training as efficiently as possible, including parallelization.

Following this guide, I see that I can use both model and data parallelism. I understand that if I run training on, let’s say, 4 different GPUs, I’d probably want BOTH parallelized processing of the data and training of the model, correct? However, the guide discusses them as alternative options.

I will try to combine the options described in the guide using

distribution={
    "smdistributed": {
        "modelparallel": smp_options,
        'dataparallel':{ 'enabled': True }
        },
    "mpi": mpi_options
}

However, before I burn large amounts of money, it would be great to get some feedback on whether this makes sense. Any general intuition to think about this would also be greatly appreciated.

1 Like