Running into cuda out of memory when running llama2-13b-chat model on multi-gpu machine

sivaram002 · August 17, 2023, 2:08am

I’m trying to run llama2 13b model with rope scaling on the AWS g4dn.12xlarge machine with has 4 gpus with 16 GB VRAM each but getting cuda out of memory error.

Code:



from transformers import AutoModelForCausalLM, AutoTokenizer,pipeline
import transformers

tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-Chat-fp16", use_fast=False)
model = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-13B-Chat-fp16", device_map = 'auto',
                                             **{"rope_scaling":{"factor": 2.0,"type": "linear"}}
    )

user_prompt = "..."

pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device_map="auto",
)


sequences = pipeline(
   user_prompt,
    max_length=8000,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)

print(sequences)

This is the error I’m getting when the prompt size is greater than 4k

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.51 GiB (GPU 0; 14.61 GiB total capacity; 11.92 GiB already allocated; 1.76 GiB free; 12.14 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Is 64GB not enough to run the model with 8k context or is there a bug in my code?

EDIT: Tried running it on g5.12xlarge which has 96GB vRAM (4 gpus with 24 GB each) and still running into cuda out of memory.

nuxero · August 28, 2023, 10:00pm

Hey there!

A newbie here.

I was facing this very same issue. What I learned is that the model is loaded on just one of the gpu cards, so you need enough VRAM on such gpu. For the 13b model this is around 26GB. On AWS the biggest VRAM I could find was 24GB on g5 instances.

I solved it by loading the model using 8bit option, which requires less VRAM than the default 16bit.

If you have more than one GPUs available, you can spread the load across multiple devices by specifying how much you want to load on each one. I didn’t try this but it’s said that is slower.

I hope this help you.

EDIT: I edited this answer multiple times because I published it unfinished by mistake multiple times

fintech06 · November 2, 2023, 3:05am

hi how bout on colab

SafeyahShemali · November 9, 2023, 11:49pm

Hello,

Can you specify how you could load the model using the 8bit option?

phamvantoan · December 21, 2023, 8:43am

Hi @sivaram002 ,

I’m not sure if you already fixed you problem.

However, I just post one solution here when using VLLM. Note that, you need to instal vllm package under Linux by: pip install vllm

The example code (set tensor_parallel_size=4 for your case):

from langchain.llms import VLLM
model = VLLM(model=“meta-llama/Llama-2-13b-chat-hf”, tensor_parallel_size=4, trust_remote_code=True, max_new_tokens=1024, temperature=0.6, top_k=5, top_p=0.9)
print(model(“your query here”))

Original ref: vLLM | 🦜️🔗 Langchain

Hope this helps!

OxxoCodes · December 21, 2023, 11:06pm

If using Hugging Face’s accelerate library is a viable option for you, enabling DeepSpeed is also a very easy way to decrease your GPU memory utilization without affecting downstream performance, and should be usable regardless of what precision you’re using to load the model.

Topic		Replies	Views
CUDA Out-of-Memory Error with llama2-13b-chat Model on Multi-GPU Server 🤗Transformers	0	1146	December 5, 2023
Training llama2-13b-16k model with peft on 3 A100 of 80GB is still throwing cuda out of memory 🤗Accelerate	0	790	October 16, 2023
Colab CUDA OOM using Llama-2-7b-chat-hf even with 40GPU RAM 🤗Transformers	0	906	December 29, 2023
Finetuning Llama 13B with my own dataset 🤗Transformers	2	2795	June 27, 2023
CUDA out of memory on multi-GPU 🤗Transformers	1	2660	March 6, 2024

Running into cuda out of memory when running llama2-13b-chat model on multi-gpu machine

Related topics