This might be a simple question, but bugged me the whole afternoon.
I was trying to use a pretained m2m 12B model for language processing task (44G model file). I have 8 Tesla-V100 GPU cards, each of which has 32GB graphics memory. The program OOMed at:
model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100-12B-avg-5-ckpt")
Error being:
RuntimeError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 31.75 GiB total capacity; 30.49 GiB already allocated; 177.75 MiB free; 30.50 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I know the problem is one single GPU card’s memory is not big enough to load the whole model, but how can I leverage all my 8 cards memory to load the model and do predictions/generations? There must be someway to do this, otherwise if we have models that 's really huge, we eventually can’t have a GPU card with enough memory to load the model. I would really appreciate if someone can point me some directions or show me the path. Thanks in advance!
Thanks for the info, I was able to locate these techniques, but my experiment doesn’t show great improvement. Let me keep digging and see what happens.
Hi @jasonme ,
Did you manage to solve the issue? My understanding is that data parallelism (links posted by @cog ) is not useful in your case because what you’re trying to do is model parallelism, i.e. splitting the same model across multiple GPUs, whereas data parallelism distributes the data across multiple GPUs to speed up training, but each GPU still needs to be big enough to load the whole model, which is not the case for you.
I have the same issue, please let me know if you managed to find a solution
Below is a fully working example for me to load code llama into multiple GPUs.
from transformers import pipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
import time
import torch
from accelerate import init_empty_weights, load_checkpoint_and_dispatch
t1= time.perf_counter()
tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-34b-Instruct-hf")
model = AutoModelForCausalLM.from_pretrained("codellama/CodeLlama-34b-Instruct-hf", device_map="auto")
t2= time.perf_counter()
print(f"Loading tokenizer and model: took {t2-t1} seconds to execute.")
# Create a pipeline
code_generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
t3= time.perf_counter()
print(f"Creating piepline: took {t3-t2} seconds to execute.")
# Generate code for an input string
while True:
print("\n=========Please type in your question=========================\n")
user_content = input("\nQuestion: ") # User question
user_content.strip()
t1= time.perf_counter()
generated_code = code_generator(user_content, max_length=256)[0]['generated_text']
t2= time.perf_counter()
print(f"Inferencing using the model: took {t2-t1} seconds to execute.")
print(generated_code)
Firstly, I would like to know this loading method can also be applied for a large language model with 13B params with 3 shards (including 3 *.bin pre-trained model files with around 26GB in total)?
Or I need to merge all shards into a single file before loading to multiple GPUs?
Secondly, if multiple shards of the pre-trained model does not matter, is it possible for a pre-trained LLM-13B (26GB) and an embedding model (3GB) can be loaded in 2 GPU 4090 (each GPU with 24GB VRAM)?
In fact, I intend to build a desktop with 2 Geforce RTX 4090 installed in the motherboard and the exact LLM I use is Baichuan2-13B-Chat (baichuan-inc/Baichuan2-13B-Chat · Hugging Face). Thus, I need to check carefully before buying that.