Distributed training large models on cloud resources

rmiller3 · March 20, 2024, 5:22pm

I’m trying to fine tune mosaicml/mpt-7b-instruct using cloud resources (modal.com) with mutliple GPUs but still running out of memory even on the large GPUs.

I’m using the huggingface trainer to try this and confirmed the training_args resolved n_gpu to 2 or 4 depending on how many I configure modal to have.

Can someone help me understand if separate scripts like torchrun (or maybe accelerate launch…) are required? Some information I read seems to conflict.

For example, the official docs show using torchrun (and run_clm for other forms):
https://huggingface.co/docs/transformers/en/perf_train_gpu_many

This stackoverflow seems to agree: machine learning - How to use Huggingface Trainer with multiple GPUs? - Stack Overflow

However, these forum posts imply that the trainer should work 'out of the box' if n_gpu is proper: > https://discuss.huggingface.co/t/finetuning-gpt2-using-multiple-gpu-and-trainer/796/10

This says someone was surprised that their process was forked: How to restrict training to one GPU if multiple are available, co - #3 by dropout05

As a separate question maybe warranting a separate thread later, what should the torchrun training script do? Load the model, preprocess the dataset and call train? No special parameters to tell it to run on multiple machines?

Is there any way torchrun can be encapsulated in a single training call with parameters to split the load across all GPU?

Note: I’m running this on a cloud service (modal.com) so I’m assuming it’s not running on multiple GPUs but I can’t really see the GPU load as far as I know.

rmiller3 · March 20, 2024, 7:37pm

Perhaps the seemingly conflicting information comes from the definition of distributed (distributed across multiple nodes / machines vs distributed across GPUs).

muellerzr · March 20, 2024, 7:46pm

Have you configured something like DeepSpeed or FSDP? Native DDP (what it’s doing rn) still loads all of the model on each GPU. You want a sharding mechanism that will split the model up onto each GPU during training

rmiller3 · March 20, 2024, 8:34pm

Thanks, I’ll look into those now. I hadn’t.

singhjagpreet · March 27, 2024, 7:28am

Have you tried accelerate

rmiller3 · March 27, 2024, 6:38pm

I got it working from a tutorial on modal.com which uses accelerate. Working backwards off the github mentioned here. Fine-tune an LLM in minutes (ft. Llama 2, CodeLlama, Mistral, etc.) | Modal Docs

mosaicml/mpt-7b had some hiccups since the default configs use peft and it was missing some function that needed it, but I was able to do some finetuning there.

The tutorial uses accelerate, but has a deprecated code using torch.distributed which I’m also trying to get working. I’m not sure of the pros and cons of one vs the other yet (torch.distributed vs accelerate).

muellerzr · March 27, 2024, 7:25pm

They are synonymous, accelerate just wraps around it (For on GPUs)

Topic		Replies	Views
Trainer API for data parallel on multi-node 🤗Transformers	4	90	February 6, 2025
Multi gpu training 🤗Transformers	3	6013	April 24, 2022
Multiple gpu training 🤗Transformers	1	2291	August 10, 2024
Distributed Training on Sagemaker Amazon SageMaker	13	2719	August 5, 2021
Boilerplate for Trainer using torch.distributed Beginners	4	2041	January 11, 2022

Distributed training large models on cloud resources

Related topics