Model Parallelism, how to parallelize transformer?

valgi0 · May 19, 2021, 5:01pm

Hi there,
I am pretty new, I hope to do it right:)

I have two gpus nvidia, which work fine. I can train model on each of them, I can use data parallelism. I wonder if I can parallelize the model itself. Surfing the internet I found it is possible but no one tells how. Some frameworks do it as torchgpipe, deepspeed PipelineModule, Fairscale but they wants sequential models but transformers are hard to turn sequential.

Can you point me in the right direction?

Specs:
I want to parallelize BERT model on two gpus titan xp.

Thank you, every hints or helps will be appreciated

valgi0

lewtun · May 20, 2021, 12:37pm

hey @valgi0 my suggestion would be to try out the new accelerate library: GitHub - huggingface/accelerate: 🚀 A simple way to train and use PyTorch models with multi-GPU, TPU, mixed-precision

in particular, there is an nlp example that shows you how to configure accelerate for the multi-GPU case here: accelerate/examples at main · huggingface/accelerate · GitHub

valgi0 · June 18, 2021, 8:38am

thank you for the answer
I am sorry to answer after so much time but I was pretty busy.
However, I checked accelerate and it performs only data parallelism. Am I right?

I found out some models as T5, GPT2 have parallelize() method to split encoder and decoder on different devices. But that has serious limits, you need a balanced encoder decoder for examples.

I would like do the same but with BERT, I tried to manually distribute encoder layers on the two different gpus. It seems to work but it lacks of optimization and it does not work with Trainer and other tools any more.

I dont know, if you have any other ideas come forward
Thank you

lewtun · June 18, 2021, 1:44pm

ah i misunderstood your original question - from what i understand deepspeed supports model parallelism of the sort you describe: Feature Overview - DeepSpeed

there’s also a dedicated page for the deepspeed integration in transformers which might help: DeepSpeed Integration — transformers 4.7.0 documentation

i know stas was able to fine-tune T5 on a single gpu this way, so unless you have a very specific reason to want to parallelise BERT, this approach might be the best

hth!

Topic		Replies	Views
Model parallel with deepspeed integration Beginners	0	639	September 14, 2021
Parallelize model call for TFBertModel 🤗Transformers	3	1031	January 7, 2021
Multiple gpu not properly parallelized during model.generate() 🤗Transformers	4	1621	October 9, 2022
Multiple GPUs do not speed up the training 🤗Accelerate	1	3446	January 26, 2022
Multi GPU training - Model parallelism DeepSpeed	1	1885	February 2, 2024

Model Parallelism, how to parallelize transformer?

Related topics