Model Parallelism, how to parallelize transformer?

ah i misunderstood your original question - from what i understand deepspeed supports model parallelism of the sort you describe: Feature Overview - DeepSpeed

there’s also a dedicated page for the deepspeed integration in transformers which might help: DeepSpeed Integration — transformers 4.7.0 documentation

i know stas was able to fine-tune T5 on a single gpu this way, so unless you have a very specific reason to want to parallelise BERT, this approach might be the best

hth!