Running into cuda out of memory when running llama2-13b-chat model on multi-gpu machine

I’m trying to run llama2 13b model with rope scaling on the AWS g4dn.12xlarge machine with has 4 gpus with 16 GB VRAM each but getting cuda out of memory error.

Code:



from transformers import AutoModelForCausalLM, AutoTokenizer,pipeline
import transformers

tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-Chat-fp16", use_fast=False)
model = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-13B-Chat-fp16", device_map = 'auto',
                                             **{"rope_scaling":{"factor": 2.0,"type": "linear"}}
    )

user_prompt = "..."

pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device_map="auto",
)


sequences = pipeline(
   user_prompt,
    max_length=8000,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)

print(sequences)

This is the error I’m getting when the prompt size is greater than 4k

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.51 GiB (GPU 0; 14.61 GiB total capacity; 11.92 GiB already allocated; 1.76 GiB free; 12.14 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Is 64GB not enough to run the model with 8k context or is there a bug in my code?

EDIT: Tried running it on g5.12xlarge which has 96GB vRAM (4 gpus with 24 GB each) and still running into cuda out of memory.

1 Like

Hey there!

A newbie here.

I was facing this very same issue. What I learned is that the model is loaded on just one of the gpu cards, so you need enough VRAM on such gpu. For the 13b model this is around 26GB. On AWS the biggest VRAM I could find was 24GB on g5 instances.

I solved it by loading the model using 8bit option, which requires less VRAM than the default 16bit.

If you have more than one GPUs available, you can spread the load across multiple devices by specifying how much you want to load on each one. I didn’t try this but it’s said that is slower.

I hope this help you.

EDIT: I edited this answer multiple times because I published it unfinished by mistake multiple times

1 Like

hi how bout on colab

Hello,

Can you specify how you could load the model using the 8bit option?

Hi @sivaram002 ,

I’m not sure if you already fixed you problem.

However, I just post one solution here when using VLLM. Note that, you need to instal vllm package under Linux by: pip install vllm

The example code (set tensor_parallel_size=4 for your case):

from langchain.llms import VLLM
model = VLLM(model=“meta-llama/Llama-2-13b-chat-hf”, tensor_parallel_size=4, trust_remote_code=True, max_new_tokens=1024, temperature=0.6, top_k=5, top_p=0.9)
print(model(“your query here”))

Original ref: vLLM | 🦜️🔗 Langchain

Hope this helps!

If using Hugging Face’s accelerate library is a viable option for you, enabling DeepSpeed is also a very easy way to decrease your GPU memory utilization without affecting downstream performance, and should be usable regardless of what precision you’re using to load the model.