Pipeline vram problem with google colab

kimsin · January 15, 2025, 8:11pm

Hello

I am using the llama 3 8b model in google colab.

I am trying to run it in a vram 40g environment.
I can upload the model with quantumization and LoRa, but
the moment I load the pipeline, the vram is exceeded and it becomes impossible to run.
To solve this problem, should I write the code myself instead of loading the pipeline?
If you have any code examples, I would really appreciate it if you could post them.0(This is a text-generation pipeline)
If you have any other methods, please let me know.

John6666 · January 16, 2025, 2:25am

Normally, the majority of VRAM is consumed when loading models, but some VRAM is also consumed during inference. Also, although the reason is not well understood, there are cases where an abnormal amount of VRAM is consumed during pipeline execution. There are several code examples for when the pipeline is not used, but I will post the official HF sample for Llama.

kimsin · January 21, 2025, 2:22pm

thank you sir

Topic		Replies	Views
Llama-2 on colab Beginners	3	11382	November 28, 2023
Colab CUDA OOM using Llama-2-7b-chat-hf even with 40GPU RAM 🤗Transformers	0	905	December 29, 2023
Llama 3.1 8b Instruct - Memory Usage More than Reported Models	5	468	February 18, 2025
VRAM keeps increasing during sequential llama2-13b inferencing Models	1	288	July 15, 2024
GPU usage increasing every loop when running inference Beginners	2	1060	May 13, 2024

Pipeline vram problem with google colab

Related topics