Colab CUDA OOM using Llama-2-7b-chat-hf even with 40GPU RAM

I am running some basic text-generation using Llama-2-7b-chat-hf. I started with 15GPU RAM in Colab then increased by using A100, to 50 GPU RAM. This should be plenty of memory.

I load the model per below:

pipeline = transformers.pipeline(
“text-generation”,
model=model,
tokenizer=tokenizer,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map=“auto”,
max_length=3000,
do_sample=True,
top_k=10,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id
)

When I run with the 15GPU RAM, with a ~10k token input, I get:
“CUDA out of memory. Tried to allocate 6.71 GiB. GPU 0 has a total capacty of 15.77 GiB of which 1.76 GiB is free. Process 22833 has 14.01 GiB memory in use. Of the allocated memory 13.24 GiB is allocated by PyTorch…”.

So I switched to the A100, however when I run the exact same model with exact same input I get:
“OutOfMemoryError: CUDA out of memory. Tried to allocate 13.42 GiB. GPU 0 has a total capacty of 39.56 GiB of which 5.29 GiB is free. Process 50573 has 34.27 GiB memory in use. Of the allocated memory 33.37 GiB is allocated by PyTorch…”

So it i doubling the amount it is trying to allocate, while Torch has 2.5x the memory allocated.

Clearly I have something wrong. Does anyone have any thoughts on how this could occur?

Thanks