Llama 2 10x slower than LLaMA 1

nsch · October 30, 2023, 7:15am

I have just switched from LLaMA 1 to Llama 2 (same hardware), and generating text takes 10x longer now. I’m guessing that there is something in my code that makes it this much slower. Can anyone point me to mistakes in the code below?

from transformers import LlamaForCausalLM, LlamaTokenizerFast


def llama_textgen(prompt, model, tokenizer, max_tokens=4):
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(
        inputs.input_ids,
        max_new_tokens=max_tokens,
        do_sample=False,
    )
    text_outputs = tokenizer.batch_decode(
        outputs[:, inputs.input_ids.shape[-1] :],
        skip_special_tokens=True,
        clean_up_tokenization_spaces=False,
    )[0]
    return text_outputs

model_base_name = "meta-llama/Llama-2-70b-chat-hf"
model = LlamaForCausalLM.from_pretrained(
    model_base_name,
    cache_dir="/llama2_chat"
)
tokenizer = LlamaTokenizerFast.from_pretrained(model_base_name)

prompt = 'I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?\n'
response = llama_textgen(
                        prompt,
                        model=model,
                        tokenizer=tokenizer,
                    )

abhiram4572 · November 7, 2023, 11:42am

Do you have access to any GPU? If so, move the data and the model to cuda and then try to do the inference, however this still does take around 30s for 250 tokens approx.

Topic		Replies	Views
Models slow on M1 Pro 16gb Beginners	0	729	December 18, 2023
Does anyone have an idea how we can run llama2 with multiple GPUs? 🤗Transformers	1	1276	October 26, 2023
Logits function too slow 🤗Transformers	0	226	December 19, 2023
Why is the huggingface generater much slower than the original llama2 generater? 🤗Transformers	0	1333	November 23, 2023
Hugging Face Llama-2 (7b) taking too much time while inferencing Models	1	1494	June 23, 2024

Llama 2 10x slower than LLaMA 1

Related topics