Inference for a 7B model on A100 takes too long?

thangphan68 · November 28, 2023, 3:36am

Hello, this is my first time trying out Huggingface with a model this big. I wonder if 2-3 seconds for a forward pass is too long or is it expected? Here is my code:

I’m running the model on 2 A100 GPUs

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import time


model_path = "mistralai/Mistral-7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_path, cache_dir='/checkpoints')
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    cache_dir='/checkpoints'
)

text = ...
inputs = tokenizer([text], return_tensors="pt").to('cuda')

start = time.perf_counter()
output = model(**inputs)
end = time.perf_counter()
print(end - start)

Output"

2.761020613834262

RomanEngeler1805 · March 15, 2024, 6:05pm

It depends on how many tokens you generate. With a single A100, I observe an inference speed of around 23 tokens / second with a Mistral 7B in FP32.

Topic		Replies	Views
Hugging Face Llama-2 (7b) taking too much time while inferencing Models	1	1495	June 23, 2024
Continuing model training takes seconds in next round 🤗Transformers	3	1411	June 1, 2023
Poor performance from Mistral-7B-Instruct-v0.1 Beginners	1	1554	March 1, 2024
Fine tuned Mistral 7B inference issue for >4k context length token with transformer 4.35+ 🤗Transformers	0	556	December 11, 2023
Not using GPU although it is specified Course	5	31066	December 30, 2024

Inference for a 7B model on A100 takes too long?

Related topics