Pipeline vram problem with google colab

Hello

I am using the llama 3 8b model in google colab.

I am trying to run it in a vram 40g environment.
I can upload the model with quantumization and LoRa, but
the moment I load the pipeline, the vram is exceeded and it becomes impossible to run.
To solve this problem, should I write the code myself instead of loading the pipeline?
If you have any code examples, I would really appreciate it if you could post them.0(This is a text-generation pipeline)
If you have any other methods, please let me know.

1 Like

Normally, the majority of VRAM is consumed when loading models, but some VRAM is also consumed during inference. Also, although the reason is not well understood, there are cases where an abnormal amount of VRAM is consumed during pipeline execution. There are several code examples for when the pipeline is not used, but I will post the official HF sample for Llama.

1 Like

thank you sir

1 Like