Hugging Face Llama-2 (7b) taking too much time while inferencing

atinesh · November 13, 2023, 10:14am

Hello everyone, I am trying to use Llama-2 (7b) from Hugging face. With below code I was able to load the model successfully but when I am trying to generate the output its taking forever.

Code

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Llama-2-7b-hf")
model = AutoModelForCausalLM.from_pretrained("Llama-2-7b-hf")

input_ids = tokenizer.encode("What is LLM?", return_tensors="pt")

output = model.generate(
        input_ids,
        temperature=0,
        max_new_tokens=100
    )

generated_text = tokenizer.decode(output[0])
print(generated_text)

Model files downloaded from Llama-2-7b-hf

Hardware: Macbook Pro (M2 Pro) 16 GB RAM

ahoo1260 · June 23, 2024, 10:25am

Hi, did you find a solution?

Topic		Replies	Views
meta-llama/Llama-3.2-11B-Vision-Instruct did not reply 🤗Transformers	10	12908	October 29, 2024
Models slow on M1 Pro 16gb Beginners	0	729	December 18, 2023
Why is the huggingface generater much slower than the original llama2 generater? 🤗Transformers	0	1333	November 23, 2023
Why the model loading of llama2 is so slow? 🤗Transformers	6	9481	April 24, 2024
Trying the inference with model Llama-2-70b-hf on 2 A100 (80g) GPUs but getting errors Beginners	6	6611	November 28, 2023

Hugging Face Llama-2 (7b) taking too much time while inferencing

Related topics