Distributed training large models on cloud resources

I’m trying to fine tune mosaicml/mpt-7b-instruct using cloud resources (modal.com) with mutliple GPUs but still running out of memory even on the large GPUs.

I’m using the huggingface trainer to try this and confirmed the training_args resolved n_gpu to 2 or 4 depending on how many I configure modal to have.

Can someone help me understand if separate scripts like torchrun (or maybe accelerate launch…) are required? Some information I read seems to conflict.

For example, the official docs show using torchrun (and run_clm for other forms):
https://huggingface.co/docs/transformers/en/perf_train_gpu_many

This stackoverflow seems to agree: machine learning - How to use Huggingface Trainer with multiple GPUs? - Stack Overflow


However, these forum posts imply that the trainer should work 'out of the box' if n_gpu is proper: > https://discuss.huggingface.co/t/finetuning-gpt2-using-multiple-gpu-and-trainer/796/10

This says someone was surprised that their process was forked: How to restrict training to one GPU if multiple are available, co - #3 by dropout05


As a separate question maybe warranting a separate thread later, what should the torchrun training script do? Load the model, preprocess the dataset and call train? No special parameters to tell it to run on multiple machines?


Is there any way torchrun can be encapsulated in a single training call with parameters to split the load across all GPU?

Note: I’m running this on a cloud service (modal.com) so I’m assuming it’s not running on multiple GPUs but I can’t really see the GPU load as far as I know.

Perhaps the seemingly conflicting information comes from the definition of distributed (distributed across multiple nodes / machines vs distributed across GPUs).

Have you configured something like DeepSpeed or FSDP? Native DDP (what it’s doing rn) still loads all of the model on each GPU. You want a sharding mechanism that will split the model up onto each GPU during training

Thanks, I’ll look into those now. I hadn’t.

Have you tried accelerate

I got it working from a tutorial on modal.com which uses accelerate. Working backwards off the github mentioned here. Fine-tune an LLM in minutes (ft. Llama 2, CodeLlama, Mistral, etc.) | Modal Docs

mosaicml/mpt-7b had some hiccups since the default configs use peft and it was missing some function that needed it, but I was able to do some finetuning there.

The tutorial uses accelerate, but has a deprecated code using torch.distributed which I’m also trying to get working. I’m not sure of the pros and cons of one vs the other yet (torch.distributed vs accelerate).

They are synonymous, accelerate just wraps around it :slight_smile: (For on GPUs)