Model Parallelism, how to parallelize transformer?

thank you for the answer
I am sorry to answer after so much time but I was pretty busy.
However, I checked accelerate and it performs only data parallelism. Am I right?

I found out some models as T5, GPT2 have parallelize() method to split encoder and decoder on different devices. But that has serious limits, you need a balanced encoder decoder for examples.

I would like do the same but with BERT, I tried to manually distribute encoder layers on the two different gpus. It seems to work but it lacks of optimization and it does not work with Trainer and other tools any more.

I dont know, if you have any other ideas come forward :slight_smile:
Thank you