Google/gemma-2-2b-it Crashes in Google colab

I’ve been trying to use the ‘google/gemma-2-2b-it’ model in Google colab for question answering. I have approval from the community for access.

However, it crashes despite having GPU allocation.

from transformers import pipeline
gengemma=pipeline(“text-generation”, model=“google/gemma-2-2b-it”, device=0) # device=0 for GPU access
gengemma(“The chances of Rafael Nadal winning another grand slam in 2025 is”,
max_length=30,
num_return_sequences=1,
truncation=True
)

OutOfMemoryError Traceback (most recent call last)
in <cell line: 4>()
2
3 # gengemma=pipeline(“text-generation”, model=“meta-llama/Llama-2-7b-chat-hf”)
----> 4 gengemma=pipeline(“text-generation”, model=“google/gemma-2-2b-it”, device=0) # device=0 for GPU access
5 gengemma(“The chances of Rafael Nadal winning another grand slam in 2025 is”,
6 max_length=30,

11 frames
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in convert(t)
1158 memory_format=convert_to_format,
1159 )
→ 1160 return t.to(
1161 device,
1162 dtype if t.is_floating_point() or t.is_complex() else None,

OutOfMemoryError: CUDA out of memory. Tried to allocate 82.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 9.06 MiB is free. Process 44930 has 14.74 GiB memory in use. Of the allocated memory 14.57 GiB is allocated by PyTorch, and 70.78 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (CUDA semantics — PyTorch 2.4 documentation)