Inference for a 7B model on A100 takes too long?

Hello, this is my first time trying out Huggingface with a model this big. I wonder if 2-3 seconds for a forward pass is too long or is it expected? Here is my code:

I’m running the model on 2 A100 GPUs

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import time


model_path = "mistralai/Mistral-7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_path, cache_dir='/checkpoints')
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    cache_dir='/checkpoints'
)

text = ...
inputs = tokenizer([text], return_tensors="pt").to('cuda')

start = time.perf_counter()
output = model(**inputs)
end = time.perf_counter()
print(end - start)

Output"

2.761020613834262

It depends on how many tokens you generate. With a single A100, I observe an inference speed of around 23 tokens / second with a Mistral 7B in FP32.