Colab CUDA OOM using Llama-2-7b-chat-hf even with 40GPU RAM

nebo333 · December 29, 2023, 11:50pm

I am running some basic text-generation using Llama-2-7b-chat-hf. I started with 15GPU RAM in Colab then increased by using A100, to 50 GPU RAM. This should be plenty of memory.

I load the model per below:

pipeline = transformers.pipeline(
“text-generation”,
model=model,
tokenizer=tokenizer,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map=“auto”,
max_length=3000,
do_sample=True,
top_k=10,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id
)

When I run with the 15GPU RAM, with a ~10k token input, I get:
“CUDA out of memory. Tried to allocate 6.71 GiB. GPU 0 has a total capacty of 15.77 GiB of which 1.76 GiB is free. Process 22833 has 14.01 GiB memory in use. Of the allocated memory 13.24 GiB is allocated by PyTorch…”.

So I switched to the A100, however when I run the exact same model with exact same input I get:
“OutOfMemoryError: CUDA out of memory. Tried to allocate 13.42 GiB. GPU 0 has a total capacty of 39.56 GiB of which 5.29 GiB is free. Process 50573 has 34.27 GiB memory in use. Of the allocated memory 33.37 GiB is allocated by PyTorch…”

So it i doubling the amount it is trying to allocate, while Torch has 2.5x the memory allocated.

Clearly I have something wrong. Does anyone have any thoughts on how this could occur?

Thanks

Topic		Replies	Views
Running into cuda out of memory when running llama2-13b-chat model on multi-gpu machine Intermediate	5	11035	December 21, 2023
CUDA Out-of-Memory Error with llama2-13b-chat Model on Multi-GPU Server 🤗Transformers	0	1140	December 5, 2023
Running into OOM on GPU with quantized llama-3-8b for text generation inference Models	0	489	June 29, 2024
Llama-2 on colab Beginners	3	11355	November 28, 2023
CUDA out of memory on multi-GPU 🤗Transformers	1	2639	March 6, 2024

Colab CUDA OOM using Llama-2-7b-chat-hf even with 40GPU RAM

Related topics