Llama2 response times - feedback

VeeP · February 6, 2024, 2:43am

Hello,

I am trying out the meta-llama/Llama-2-13b-chat-hf on a local system
Nvidia 4090 (24GB vram)
64 GB ram
i9-13900KF
Enough disk space.

pretty much doing this:

Load model directly

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained(“meta-llama/Llama-2-13b-chat-hf”)
model = AutoModelForCausalLM.from_pretrained(“meta-llama/Llama-2-13b-chat-hf”)

(file installed locally(

When I give it a prompt, here are the times:

Using device: cuda
huggingface\hub
Model loaded successfully in 239.82 seconds.
Enter your prompt: What will climate change be like in 30 years from today
Prompt received.
Encoding prompt…
Generating response…
Response generated in 924.54 seconds.
Decoding response…
Response: What will climate change be like in 30 years from today? (example prompt. Darth Vader question took longer)

Climate change is one of the most pressing issues of our time, and its impacts will only continue to grow more severe as the years go by. Here are some potential effects of climate change that we may see in the next 30 years:…(clipped)
.
Logging GPU usage…
Memory Allocated: 26124154880
Memory Cached: 52818870272

It has gone as high as 1500+ seconds (25 minutes).

Is anyone else getting durations like this on a local setup?

I have not tried the 7b chat yet.

Just curious!

Thanks,

V

Topic		Replies	Views
meta-llama/Llama-3.2-11B-Vision-Instruct did not reply 🤗Transformers	10	12939	October 29, 2024
meta-llama/Llama-2-7b-chat-hf not generate response when prompt is long Beginners	4	4136	September 27, 2023
Llama 3 Instruct taking too long all of a sudden 🤗Transformers	1	1481	June 9, 2024
Llama: local model doesn't match hosted model, why? Beginners	0	115	May 28, 2024
Waiting over 1 week for Meta approval for Llama-2 use 🤗Hub	2	2203	October 31, 2023

Llama2 response times - feedback

Load model directly

Related topics