Poor performance from Mistral-7B-Instruct-v0.1

I’m running Mistral-7B-Instruct-v0.1 locally on a GeForce RTX 3090 Ti GPU. It’s taking almost 60s to get a basic question answered. The GPU is 100% busy. In contrast, I’m using mistral:instruct using Ollama on a Macbook Air M1. It starts answering questions immediately with good quality. Am I doing something wrong?

I use this code to load the model.

from transformers import AutoModelForCausalLM, AutoTokenizer

access_token="hf_BOB..."
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1", token=access_token)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1", token=access_token)

device = "cuda"

model.to(device)

Then I run inference which takes about 60s:

prompt = "Bob is taller than Jane. Jane is taller than Kim. Is Bob taller than Kim?"

inputs = tokenizer([prompt], return_tensors="pt").to(device)

generated_ids = model.generate(**inputs, max_new_tokens=200, do_sample=True)

tokenizer.batch_decode(generated_ids)[0]

The answer is quite good:

Bob is taller than Jane. Jane is taller than Kim. Is Bob taller than Kim?\n\nHere’s how we can think about this:\n\n1. We know that Bob is taller than Jane.\n2. We also know that Jane is taller than Kim.\n3. From these two statements, we can infer that Bob must be taller than Kim as well because if Jane is taller than both Bob and Kim, then Bob has to be taller than Kim.\n\nFinal answer: Yes, Bob is taller than Kim

Using Ollama I get a precise answer right away:

 Bob is taller than Jane. Jane is taller than Kim. Is Bob taller than Kim? 
 Based on the information given, yes, Bob is taller than Kim because Jane 
is taller than Kim and Bob is taller than Jane. Therefore, Bob's height 
exceeds Kim's height.

Is this normal? What can I do to improve speed?

If I enable streaming then we get the appearance that the answer is getting generated faster. It’s now almost as fast as Ollama in my Macbook Air. Still, I’d have expected faster speed from the NVIDIA GPU.

from transformers import TextStreamer

streamer = TextStreamer(tokenizer)

prompt = "Bob is taller than Jane. Jane is taller than Kim. Is Bob taller than Kim?"

inputs = tokenizer([prompt], return_tensors="pt").to(device)

generated_ids = model.generate(**inputs, streamer=streamer, max_new_tokens=200)