How to load large model with multiple GPU cards?

This might be a simple question, but bugged me the whole afternoon.

I was trying to use a pretained m2m 12B model for language processing task (44G model file). I have 8 Tesla-V100 GPU cards, each of which has 32GB graphics memory. The program OOMed at:

model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100-12B-avg-5-ckpt")

Error being:

RuntimeError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 31.75 GiB total capacity; 30.49 GiB already allocated; 177.75 MiB free; 30.50 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I know the problem is one single GPU card’s memory is not big enough to load the whole model, but how can I leverage all my 8 cards memory to load the model and do predictions/generations? There must be someway to do this, otherwise if we have models that 's really huge, we eventually can’t have a GPU card with enough memory to load the model. I would really appreciate if someone can point me some directions or show me the path. Thanks in advance!

Thanks so much for the help!

3 Likes

hi.

you can use model DP(Data Parallel) or DDP(Distributed Data parallel) to load huge model at Multi GPUs.

https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html

https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

regards.

1 Like

Thanks for the info, I was able to locate these techniques, but my experiment doesn’t show great improvement. Let me keep digging and see what happens.

Thank you!

1 Like

Hi @jasonme ,
Did you manage to solve the issue? My understanding is that data parallelism (links posted by @cog ) is not useful in your case because what you’re trying to do is model parallelism, i.e. splitting the same model across multiple GPUs, whereas data parallelism distributes the data across multiple GPUs to speed up training, but each GPU still needs to be big enough to load the whole model, which is not the case for you.
I have the same issue, please let me know if you managed to find a solution

4 Likes

Hello @jasonme, do you have any update on your issue ? I have a similar case to deal with !

I found a solution using the pytorch modelparallel : Single-Machine Model Parallel Best Practices — PyTorch Tutorials 1.13.1+cu117 documentation
It allows to split the model into submodels on your GPUs

2 Likes

i think you can follow this :Handling big models for inference
Load model with device_map=“auto” para.

3 Likes

I followed the accelerate doc. Handling big models for inference

Below is a fully working example for me to load code llama into multiple GPUs.

from transformers import pipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
import time

import torch
from accelerate import init_empty_weights, load_checkpoint_and_dispatch

t1= time.perf_counter()
tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-34b-Instruct-hf")
model = AutoModelForCausalLM.from_pretrained("codellama/CodeLlama-34b-Instruct-hf", device_map="auto")

t2= time.perf_counter()
print(f"Loading tokenizer and model: took {t2-t1} seconds to execute.")
# Create a pipeline
code_generator = pipeline('text-generation', model=model, tokenizer=tokenizer)

t3= time.perf_counter()
print(f"Creating piepline: took {t3-t2} seconds to execute.")
# Generate code for an input string
while True:
  print("\n=========Please type in your question=========================\n")
  user_content = input("\nQuestion: ") # User question
  user_content.strip()
  t1= time.perf_counter()
  generated_code = code_generator(user_content, max_length=256)[0]['generated_text']
  t2= time.perf_counter()
  print(f"Inferencing using the model: took {t2-t1} seconds to execute.")
  print(generated_code)
3 Likes

Hi,

  • Firstly, I would like to know this loading method can also be applied for a large language model with 13B params with 3 shards (including 3 *.bin pre-trained model files with around 26GB in total)?
    Or I need to merge all shards into a single file before loading to multiple GPUs?

  • Secondly, if multiple shards of the pre-trained model does not matter, is it possible for a pre-trained LLM-13B (26GB) and an embedding model (3GB) can be loaded in 2 GPU 4090 (each GPU with 24GB VRAM)?

In fact, I intend to build a desktop with 2 Geforce RTX 4090 installed in the motherboard and the exact LLM I use is Baichuan2-13B-Chat (baichuan-inc/Baichuan2-13B-Chat · Hugging Face). Thus, I need to check carefully before buying that.

Thank you so much!