CUDA Out-of-Memory Error with llama2-13b-chat Model on Multi-GPU Server

iabhishekofficial · December 5, 2023, 12:10pm

I’m currently facing a CUDA out-of-memory error while running the llama2-13b-chat model on an AWS EC2 instance with the g4dn.12xlarge configuration. This instance is equipped with 4 T4 GPUs, each having 16GB of VRAM.

attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB (GPU 1; 14.76 GiB total capacity; 13.68 GiB already allocated; 159.75 MiB free; 13.92 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONFI'm currently facing a CUDA out-of-memory error while running the llama2-13b-chat model on an AWS EC2 instance with the g4dn.12xlarge configuration. This instance is equipped with 4 T4 GPUs, each having 16GB of VRAM.

Please provide a solution to tackle this issue. Here is the code which I am using -

model_id = “meta-llama/Llama-2-13b-chat-hf”
hf_auth = ‘’
model_config = transformers.AutoConfig.from_pretrained(
model_id,
use_auth_token=hf_auth
)
model = transformers.AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
config=model_config,
device_map=‘auto’,
use_auth_token=hf_auth,
offload_folder=“save_folder”
)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id, use_auth_token=hf_auth)
generate_text = transformers.pipeline(
model=model, tokenizer=tokenizer,
return_full_text=True,
task=‘text-generation’,
temperature=0.1,
max_new_tokens=4096,
repetition_penalty=1.1
)
llm = HuggingFacePipeline(pipeline=generate_text)

Topic		Replies	Views
Running into cuda out of memory when running llama2-13b-chat model on multi-gpu machine Intermediate	5	11051	December 21, 2023
CUDA out of memory on multi-GPU 🤗Transformers	1	2649	March 6, 2024
Colab CUDA OOM using Llama-2-7b-chat-hf even with 40GPU RAM 🤗Transformers	0	905	December 29, 2023
Multi-GPU inference with accelerate Beginners	0	1713	October 19, 2023
"meta-llama/Llama-3.2-90B-Vision-Instruct" continually crashing with "torch.OutOfMemoryError: CUDA out of memory. Tried to allocate" Beginners	1	367	November 22, 2024

CUDA Out-of-Memory Error with llama2-13b-chat Model on Multi-GPU Server

Related topics