How to load large model with multiple GPU cards?

This might be a simple question, but bugged me the whole afternoon.

I was trying to use a pretained m2m 12B model for language processing task (44G model file). I have 8 Tesla-V100 GPU cards, each of which has 32GB graphics memory. The program OOMed at:

model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100-12B-avg-5-ckpt")

Error being:

RuntimeError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 31.75 GiB total capacity; 30.49 GiB already allocated; 177.75 MiB free; 30.50 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I know the problem is one single GPU card’s memory is not big enough to load the whole model, but how can I leverage all my 8 cards memory to load the model and do predictions/generations? There must be someway to do this, otherwise if we have models that 's really huge, we eventually can’t have a GPU card with enough memory to load the model. I would really appreciate if someone can point me some directions or show me the path. Thanks in advance!

Thanks so much for the help!


you can use model DP(Data Parallel) or DDP(Distributed Data parallel) to load huge model at Multi GPUs.


Thanks for the info, I was able to locate these techniques, but my experiment doesn’t show great improvement. Let me keep digging and see what happens.

Thank you!