I’m trying to run llama2 13b model with rope scaling on the AWS g4dn.12xlarge machine with has 4 gpus with 16 GB VRAM each but getting cuda out of memory error.
Code:
from transformers import AutoModelForCausalLM, AutoTokenizer,pipeline
import transformers
tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-Chat-fp16", use_fast=False)
model = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-13B-Chat-fp16", device_map = 'auto',
**{"rope_scaling":{"factor": 2.0,"type": "linear"}}
)
user_prompt = "..."
pipeline = transformers.pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
device_map="auto",
)
sequences = pipeline(
user_prompt,
max_length=8000,
do_sample=True,
top_k=10,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
)
print(sequences)
This is the error I’m getting when the prompt size is greater than 4k
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.51 GiB (GPU 0; 14.61 GiB total capacity; 11.92 GiB already allocated; 1.76 GiB free; 12.14 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Is 64GB not enough to run the model with 8k context or is there a bug in my code?
EDIT: Tried running it on g5.12xlarge which has 96GB vRAM (4 gpus with 24 GB each) and still running into cuda out of memory.