This might be a simple question, but bugged me the whole afternoon.
I was trying to use a pretained m2m 12B model for language processing task (44G model file). I have 8 Tesla-V100 GPU cards, each of which has 32GB graphics memory. The program OOMed at:
model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100-12B-avg-5-ckpt")
RuntimeError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 31.75 GiB total capacity; 30.49 GiB already allocated; 177.75 MiB free; 30.50 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I know the problem is one single GPU card’s memory is not big enough to load the whole model, but how can I leverage all my 8 cards memory to load the model and do predictions/generations? There must be someway to do this, otherwise if we have models that 's really huge, we eventually can’t have a GPU card with enough memory to load the model. I would really appreciate if someone can point me some directions or show me the path. Thanks in advance!
Thanks so much for the help!