I’m trying to fine tune mosaicml/mpt-7b-instruct using cloud resources (modal.com) with mutliple GPUs but still running out of memory even on the large GPUs.
I’m using the huggingface trainer to try this and confirmed the training_args resolved n_gpu to 2 or 4 depending on how many I configure modal to have.
Can someone help me understand if separate scripts like torchrun (or maybe accelerate launch…) are required? Some information I read seems to conflict.
For example, the official docs show using torchrun (and run_clm for other forms):
https://huggingface.co/docs/transformers/en/perf_train_gpu_many
This stackoverflow seems to agree: machine learning - How to use Huggingface Trainer with multiple GPUs? - Stack Overflow
However, these forum posts imply that the trainer should work 'out of the box' if n_gpu is proper: > https://discuss.huggingface.co/t/finetuning-gpt2-using-multiple-gpu-and-trainer/796/10
This says someone was surprised that their process was forked: How to restrict training to one GPU if multiple are available, co - #3 by dropout05
As a separate question maybe warranting a separate thread later, what should the torchrun training script do? Load the model, preprocess the dataset and call train? No special parameters to tell it to run on multiple machines?
Is there any way torchrun can be encapsulated in a single training call with parameters to split the load across all GPU?
Note: I’m running this on a cloud service (modal.com) so I’m assuming it’s not running on multiple GPUs but I can’t really see the GPU load as far as I know.