Model Parallelism, how to parallelize transformer?

Hi there,
I am pretty new, I hope to do it right:)

I have two gpus nvidia, which work fine. I can train model on each of them, I can use data parallelism. I wonder if I can parallelize the model itself. Surfing the internet I found it is possible but no one tells how. Some frameworks do it as torchgpipe, deepspeed PipelineModule, Fairscale but they wants sequential models but transformers are hard to turn sequential.

Can you point me in the right direction?

I want to parallelize BERT model on two gpus titan xp.

Thank you, every hints or helps will be appreciated


hey @valgi0 my suggestion would be to try out the new accelerate library: GitHub - huggingface/accelerate: 🚀 A simple way to train and use PyTorch models with multi-GPU, TPU, mixed-precision

in particular, there is an nlp example that shows you how to configure accelerate for the multi-GPU case here: accelerate/examples at main · huggingface/accelerate · GitHub

thank you for the answer
I am sorry to answer after so much time but I was pretty busy.
However, I checked accelerate and it performs only data parallelism. Am I right?

I found out some models as T5, GPT2 have parallelize() method to split encoder and decoder on different devices. But that has serious limits, you need a balanced encoder decoder for examples.

I would like do the same but with BERT, I tried to manually distribute encoder layers on the two different gpus. It seems to work but it lacks of optimization and it does not work with Trainer and other tools any more.

I dont know, if you have any other ideas come forward :slight_smile:
Thank you

ah i misunderstood your original question - from what i understand deepspeed supports model parallelism of the sort you describe: Feature Overview - DeepSpeed

there’s also a dedicated page for the deepspeed integration in transformers which might help: DeepSpeed Integration — transformers 4.7.0 documentation

i know stas was able to fine-tune T5 on a single gpu this way, so unless you have a very specific reason to want to parallelise BERT, this approach might be the best