Simple example run takes 5+ minutes on rtx3060 - falcon7B

Super noob here. :slight_smile:

I’m trying to use falcon-7b-instruct locally in Pycharm using a new computer with a rtx3060 card, latest nvidia drivers, decent CPU and plenty of regular ram. The GPU has 12gb memory. Windows 11.

But just a simple run take more than 5minutes so I’m feeling I must be doing something wrong here.
Would anyone be able to tell me roughly what times I should expect and what I might do wrong?

EDIT/UPDATE: It looks to me like its loading the model up to regular RAM and using the CPU for the inference. But this common example code is meant to use GPU no?

This is the code:

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model = “…/falcon-7b-instruct”

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
sequences = pipeline(
“Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:”,
for seq in sequences:
print(f"Result: {seq[‘generated_text’]}")

Also, I’m thinking there must be a way to not load the model into memory every time?
How could I load it up once, maybe in a separate thread and then access the GPU loaded model in a separate run thread? So I can iterate faster when programming?


Replying to myself, I read to try to force GPU run so I tried switching,

which did something but I then got an error message saying that I wasn’t using a CUDA enabled version of pytorch.
So I figured out to unistall the current ‘torch’ and then using the shell window inside pycharm to add it back according to info in the website,
pip3 install torch torchvision torchaudio --index-url

I could then see that the program ran un the GPU and now took 20-30 seconds instead. Still not fast I was hoping for faster but maybe there’s other things I can do to optimize.

I’d say think question is solved now.
I still want to figure out how I can keep the model uploaded to GPU vram and iterate on the program, changing the code and re-running repeatedly. That’s a separate post I guess.