Hello!
Super noob here.
I’m trying to use falcon-7b-instruct locally in Pycharm using a new computer with a rtx3060 card, latest nvidia drivers, decent CPU and plenty of regular ram. The GPU has 12gb memory. Windows 11.
But just a simple run take more than 5minutes so I’m feeling I must be doing something wrong here.
Would anyone be able to tell me roughly what times I should expect and what I might do wrong?
EDIT/UPDATE: It looks to me like its loading the model up to regular RAM and using the CPU for the inference. But this common example code is meant to use GPU no?
This is the code:
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torchmodel = “…/falcon-7b-instruct”
tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
“text-generation”,
model=model,
tokenizer=tokenizer,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map=“auto”,
)
sequences = pipeline(
“Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:”,
max_length=200,
do_sample=True,
top_k=10,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
print(f"Result: {seq[‘generated_text’]}")
Also, I’m thinking there must be a way to not load the model into memory every time?
How could I load it up once, maybe in a separate thread and then access the GPU loaded model in a separate run thread? So I can iterate faster when programming?
Cheers!
Fred