Multiple gpu training

Can I please ask if it’s possible to do multi gpu training if the whole model itself doesn’t fit on one gpu when loaded? For example, I’m training using the Trainer from huggingface Llama3.1 8b in full precision on 4 gpus of 16 GB VRAM each. The model takes up about 32GB when loaded, so each graphic is taken up to about 8GB (8*4).
When I run the training, the number of steps equals = (dataset length) * (number of epochs) / (batch size). If it were distributed, it’s additionally divided by the number of graphics cards.
So is this even possible with this computational power and if so, is there any way to integrate this in Trainer from huggingface? I’m doing the code in jupyternotebook and even if I load model that fits on single gpu, the training never start distributed.

you can take a look at Efficient Training on Multiple GPUs
if i understood your context, you are looking at Case 2: Your model doesn’t fit onto a single GPU: