TLDR: Hi, I am trying to train a (lora/p-tune) PEFT model on Falcon 40b model using 3 A100s. I am trying to implement model parallelism as bf16/fp16 model wont fit on one GPU. Is there a way to do it?
I have implemented a trainer method. According to deepspeed integration documentation , calling the script using the deepspeed launcher and adding the --deepspeed ds_config.json should implement the training on multi-gpu automatically. However, I am seeing that 3 different processes are setup (akin to data parallelism) for each GPU and it ends in an OOM error.
On a different page, I found a reference to parallelformers but I was unable to implement that successfully either.
Am I missing something? Is there a reference tutorial where model parallelism of large language models has been shown?
I have tried the device_map as auto which throws the tensors being on two different devices error.
@sgugger Hopeful for your help here