Poor performance from Mistral-7B-Instruct-v0.1

bibhas22 · March 1, 2024, 4:17pm

I’m running Mistral-7B-Instruct-v0.1 locally on a GeForce RTX 3090 Ti GPU. It’s taking almost 60s to get a basic question answered. The GPU is 100% busy. In contrast, I’m using mistral:instruct using Ollama on a Macbook Air M1. It starts answering questions immediately with good quality. Am I doing something wrong?

I use this code to load the model.

from transformers import AutoModelForCausalLM, AutoTokenizer

access_token="hf_BOB..."
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1", token=access_token)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1", token=access_token)

device = "cuda"

model.to(device)

Then I run inference which takes about 60s:

prompt = "Bob is taller than Jane. Jane is taller than Kim. Is Bob taller than Kim?"

inputs = tokenizer([prompt], return_tensors="pt").to(device)

generated_ids = model.generate(**inputs, max_new_tokens=200, do_sample=True)

tokenizer.batch_decode(generated_ids)[0]

The answer is quite good:

Bob is taller than Jane. Jane is taller than Kim. Is Bob taller than Kim?\n\nHere’s how we can think about this:\n\n1. We know that Bob is taller than Jane.\n2. We also know that Jane is taller than Kim.\n3. From these two statements, we can infer that Bob must be taller than Kim as well because if Jane is taller than both Bob and Kim, then Bob has to be taller than Kim.\n\nFinal answer: Yes, Bob is taller than Kim

Using Ollama I get a precise answer right away:

 Bob is taller than Jane. Jane is taller than Kim. Is Bob taller than Kim? 
 Based on the information given, yes, Bob is taller than Kim because Jane 
is taller than Kim and Bob is taller than Jane. Therefore, Bob's height 
exceeds Kim's height.

Is this normal? What can I do to improve speed?

bibhas22 · March 1, 2024, 8:06pm

If I enable streaming then we get the appearance that the answer is getting generated faster. It’s now almost as fast as Ollama in my Macbook Air. Still, I’d have expected faster speed from the NVIDIA GPU.

from transformers import TextStreamer

streamer = TextStreamer(tokenizer)

prompt = "Bob is taller than Jane. Jane is taller than Kim. Is Bob taller than Kim?"

inputs = tokenizer([prompt], return_tensors="pt").to(device)

generated_ids = model.generate(**inputs, streamer=streamer, max_new_tokens=200)

Topic		Replies	Views
Fine tuned Mistral 7B inference issue for >4k context length token with transformer 4.35+ 🤗Transformers	0	547	December 11, 2023
sentence-transformers/all-MiniLM-L6-v2 Not working all of a sudden Beginners	9	114	May 8, 2025
Usage issue regarding Mistral 🤗Transformers	0	431	March 1, 2024
When I'm downloading the weights, the cell keeps running and doesn't stop. I need to fine tune Mistral-Small-3.1-24B-Instruct-2503 model 🤗Transformers	4	38	May 2, 2025
Keep getting the same output from Mistral-7b-Instruct Beginners	4	1312	December 24, 2024

Poor performance from Mistral-7B-Instruct-v0.1

Related topics