Llama2 response times - feedback


I am trying out the meta-llama/Llama-2-13b-chat-hf on a local system
Nvidia 4090 (24GB vram)
64 GB ram
Enough disk space.

pretty much doing this:

Load model directly

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained(“meta-llama/Llama-2-13b-chat-hf”)
model = AutoModelForCausalLM.from_pretrained(“meta-llama/Llama-2-13b-chat-hf”)

(file installed locally(

When I give it a prompt, here are the times:

Using device: cuda
Model loaded successfully in 239.82 seconds.
Enter your prompt: What will climate change be like in 30 years from today
Prompt received.
Encoding prompt…
Generating response…
Response generated in 924.54 seconds.
Decoding response…
Response: What will climate change be like in 30 years from today? (example prompt. Darth Vader question took longer)

Climate change is one of the most pressing issues of our time, and its impacts will only continue to grow more severe as the years go by. Here are some potential effects of climate change that we may see in the next 30 years:…(clipped)
Logging GPU usage…
Memory Allocated: 26124154880
Memory Cached: 52818870272

It has gone as high as 1500+ seconds (25 minutes).

Is anyone else getting durations like this on a local setup?

I have not tried the 7b chat yet.

Just curious!