How to load large model with multiple GPU cards?

jasonme · May 30, 2022, 11:41am

This might be a simple question, but bugged me the whole afternoon.

I was trying to use a pretained m2m 12B model for language processing task (44G model file). I have 8 Tesla-V100 GPU cards, each of which has 32GB graphics memory. The program OOMed at:

model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100-12B-avg-5-ckpt")

Error being:

RuntimeError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 31.75 GiB total capacity; 30.49 GiB already allocated; 177.75 MiB free; 30.50 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I know the problem is one single GPU card’s memory is not big enough to load the whole model, but how can I leverage all my 8 cards memory to load the model and do predictions/generations? There must be someway to do this, otherwise if we have models that 's really huge, we eventually can’t have a GPU card with enough memory to load the model. I would really appreciate if someone can point me some directions or show me the path. Thanks in advance!

Thanks so much for the help!

cog · May 31, 2022, 5:48am

hi.

you can use model DP(Data Parallel) or DDP(Distributed Data parallel) to load huge model at Multi GPUs.

https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html

https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

regards.

jasonme · May 31, 2022, 4:15pm

Thanks for the info, I was able to locate these techniques, but my experiment doesn’t show great improvement. Let me keep digging and see what happens.

Thank you!

AndreaSottana · February 15, 2023, 11:04am

Hi @jasonme ,
Did you manage to solve the issue? My understanding is that data parallelism (links posted by @cog ) is not useful in your case because what you’re trying to do is model parallelism, i.e. splitting the same model across multiple GPUs, whereas data parallelism distributes the data across multiple GPUs to speed up training, but each GPU still needs to be big enough to load the whole model, which is not the case for you.
I have the same issue, please let me know if you managed to find a solution

polynomds1 · February 16, 2023, 9:31am

Hello @jasonme, do you have any update on your issue ? I have a similar case to deal with !

polynomds1 · February 16, 2023, 9:57am

I found a solution using the pytorch modelparallel : Single-Machine Model Parallel Best Practices — PyTorch Tutorials 1.13.1+cu117 documentation
It allows to split the model into submodels on your GPUs

pinsu · May 18, 2023, 6:39am

i think you can follow this :Handling big models for inference
Load model with device_map=“auto” para.

liaoch · September 4, 2023, 3:35am

I followed the accelerate doc. Handling big models for inference

Below is a fully working example for me to load code llama into multiple GPUs.

from transformers import pipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
import time

import torch
from accelerate import init_empty_weights, load_checkpoint_and_dispatch

t1= time.perf_counter()
tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-34b-Instruct-hf")
model = AutoModelForCausalLM.from_pretrained("codellama/CodeLlama-34b-Instruct-hf", device_map="auto")

t2= time.perf_counter()
print(f"Loading tokenizer and model: took {t2-t1} seconds to execute.")
# Create a pipeline
code_generator = pipeline('text-generation', model=model, tokenizer=tokenizer)

t3= time.perf_counter()
print(f"Creating piepline: took {t3-t2} seconds to execute.")
# Generate code for an input string
while True:
  print("\n=========Please type in your question=========================\n")
  user_content = input("\nQuestion: ") # User question
  user_content.strip()
  t1= time.perf_counter()
  generated_code = code_generator(user_content, max_length=256)[0]['generated_text']
  t2= time.perf_counter()
  print(f"Inferencing using the model: took {t2-t1} seconds to execute.")
  print(generated_code)

phamvantoan · October 25, 2023, 5:51am

Hi,

Firstly, I would like to know this loading method can also be applied for a large language model with 13B params with 3 shards (including 3 *.bin pre-trained model files with around 26GB in total)?
Or I need to merge all shards into a single file before loading to multiple GPUs?
Secondly, if multiple shards of the pre-trained model does not matter, is it possible for a pre-trained LLM-13B (26GB) and an embedding model (3GB) can be loaded in 2 GPU 4090 (each GPU with 24GB VRAM)?

In fact, I intend to build a desktop with 2 Geforce RTX 4090 installed in the motherboard and the exact LLM I use is Baichuan2-13B-Chat (baichuan-inc/Baichuan2-13B-Chat · Hugging Face). Thus, I need to check carefully before buying that.

Thank you so much!

Topic		Replies	Views
Can't load huge model onto multiple GPU's Beginners	5	5323	June 15, 2023
General question about large model loading 🤗Accelerate	2	953	November 28, 2024
Model Parallelism and Pipelining for Model Training Beginners	3	3539	April 11, 2024
Model training in Multi GPU 🤗Transformers	1	1832	March 17, 2021
How can I load large models like google/mt5-xl on a GPU Models	2	1747	April 30, 2022

How to load large model with multiple GPU cards?

Related topics