Hi there!
I want to train a model from scratch using a dataset of ~35M online comments. Given the computational costs, I hope to set up the training as efficiently as possible, including parallelization.
Following this guide, I see that I can use both model and data parallelism. I understand that if I run training on, let’s say, 4 different GPUs, I’d probably want BOTH parallelized processing of the data and training of the model, correct? However, the guide discusses them as alternative options.
I will try to combine the options described in the guide using
distribution={
"smdistributed": {
"modelparallel": smp_options,
'dataparallel':{ 'enabled': True }
},
"mpi": mpi_options
}
However, before I burn large amounts of money, it would be great to get some feedback on whether this makes sense. Any general intuition to think about this would also be greatly appreciated.