Hello, this is my first time trying out Huggingface with a model this big. I wonder if 2-3 seconds for a forward pass is too long or is it expected? Here is my code:
I’m running the model on 2 A100 GPUs
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import time
model_path = "mistralai/Mistral-7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_path, cache_dir='/checkpoints')
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="auto",
torch_dtype=torch.bfloat16,
cache_dir='/checkpoints'
)
text = ...
inputs = tokenizer([text], return_tensors="pt").to('cuda')
start = time.perf_counter()
output = model(**inputs)
end = time.perf_counter()
print(end - start)
Output"
2.761020613834262