Hello
I am using the llama 3 8b model in google colab.
I am trying to run it in a vram 40g environment.
I can upload the model with quantumization and LoRa, but
the moment I load the pipeline, the vram is exceeded and it becomes impossible to run.
To solve this problem, should I write the code myself instead of loading the pipeline?
If you have any code examples, I would really appreciate it if you could post them.0(This is a text-generation pipeline)
If you have any other methods, please let me know.