Multi GPU training - Model parallelism

namranpanwwt · August 10, 2023, 12:00pm

TLDR: Hi, I am trying to train a (lora/p-tune) PEFT model on Falcon 40b model using 3 A100s. I am trying to implement model parallelism as bf16/fp16 model wont fit on one GPU. Is there a way to do it?

I have implemented a trainer method. According to deepspeed integration documentation , calling the script using the deepspeed launcher and adding the --deepspeed ds_config.json should implement the training on multi-gpu automatically. However, I am seeing that 3 different processes are setup (akin to data parallelism) for each GPU and it ends in an OOM error.

On a different page, I found a reference to parallelformers but I was unable to implement that successfully either.

Am I missing something? Is there a reference tutorial where model parallelism of large language models has been shown?

I have tried the device_map as auto which throws the tensors being on two different devices error.

@sgugger Hopeful for your help here

stelli · February 2, 2024, 12:38am

Hi have you figured this out? running into the same error.

Topic		Replies	Views
LoRA training with accelerate / deepspeed DeepSpeed	3	2331	May 28, 2025
Model Parallelism, how to parallelize transformer? Beginners	3	12717	June 18, 2021
Model parallel with deepspeed integration Beginners	0	639	September 14, 2021
Manual pipeline parallelization with DeepSpeed DeepSpeed	0	758	January 7, 2023
Issues with using DeepSpeed on multiple GPUs DeepSpeed	2	2526	September 9, 2022

Multi GPU training - Model parallelism

Related topics