I am pretty new, I hope to do it right:)
I have two gpus nvidia, which work fine. I can train model on each of them, I can use data parallelism. I wonder if I can parallelize the model itself. Surfing the internet I found it is possible but no one tells how. Some frameworks do it as torchgpipe, deepspeed PipelineModule, Fairscale but they wants sequential models but transformers are hard to turn sequential.
Can you point me in the right direction?
I want to parallelize BERT model on two gpus titan xp.
Thank you, every hints or helps will be appreciated
thank you for the answer
I am sorry to answer after so much time but I was pretty busy.
However, I checked accelerate and it performs only data parallelism. Am I right?
I found out some models as T5, GPT2 have parallelize() method to split encoder and decoder on different devices. But that has serious limits, you need a balanced encoder decoder for examples.
I would like do the same but with BERT, I tried to manually distribute encoder layers on the two different gpus. It seems to work but it lacks of optimization and it does not work with Trainer and other tools any more.
I dont know, if you have any other ideas come forward
ah i misunderstood your original question - from what i understand deepspeed supports model parallelism of the sort you describe: Feature Overview - DeepSpeed
there’s also a dedicated page for the deepspeed integration in
transformers which might help: DeepSpeed Integration — transformers 4.7.0 documentation
i know stas was able to fine-tune T5 on a single gpu this way, so unless you have a very specific reason to want to parallelise BERT, this approach might be the best