CUDA Out-of-Memory Error with llama2-13b-chat Model on Multi-GPU Server

I’m currently facing a CUDA out-of-memory error while running the llama2-13b-chat model on an AWS EC2 instance with the g4dn.12xlarge configuration. This instance is equipped with 4 T4 GPUs, each having 16GB of VRAM.

attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB (GPU 1; 14.76 GiB total capacity; 13.68 GiB already allocated; 159.75 MiB free; 13.92 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONFI'm currently facing a CUDA out-of-memory error while running the llama2-13b-chat model on an AWS EC2 instance with the g4dn.12xlarge configuration. This instance is equipped with 4 T4 GPUs, each having 16GB of VRAM.

Please provide a solution to tackle this issue. Here is the code which I am using -

model_id = “meta-llama/Llama-2-13b-chat-hf”
hf_auth = ‘’
model_config = transformers.AutoConfig.from_pretrained(
model_id,
use_auth_token=hf_auth
)
model = transformers.AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
config=model_config,
device_map=‘auto’,
use_auth_token=hf_auth,
offload_folder=“save_folder”
)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id, use_auth_token=hf_auth)
generate_text = transformers.pipeline(
model=model, tokenizer=tokenizer,
return_full_text=True,
task=‘text-generation’,
temperature=0.1,
max_new_tokens=4096,
repetition_penalty=1.1
)
llm = HuggingFacePipeline(pipeline=generate_text)