Llama 3 performance is 4 mins. can get it in seconds?

Magesh78 · July 4, 2024, 8:03am

Hello Everyone,

Model - llama_3_8B_Instruct
Hardware - VM 32GB RAM and 16 GB (one GPU-Tesla T4)

I want to extract structured information in JSON format from the email input data. I used transformer library to implement the model. I set max_new_tokens as 300 and temperature as 0.01 for my output generation. Apart from pre and post processing of input data, model.generate() method takes long time to execute, so my output response time around 4 mins.

GPU utilization shows 80 percent when running the model.

Do you guys think that its reasonable processing time?
Is it possible reduce the output generation time under 1 min or in few seconds? (without GPU upgrade)
Kindly pour your thoughts on it?

Thanks.

sriram6399 · March 23, 2025, 10:05pm

Experiencing a similar issue. Any suggestions or fixes are greatly appreciated.

John6666 · March 24, 2025, 3:50am

If you have 16GB of VRAM, it won’t be enough for the 8B model. Some of it will be offloaded to regular RAM, and it will be very slow. We recommend using quantization as follows, or using a program like Ollama that basically uses quantization.

Quantization with Transformers

pip install bitsandbytes

from transformers import pipeline, BitsAndBytesConfig
import torch

nf4_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16)
model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
pipe = pipeline("text-generation", model=model_id, quantization_config=nf4_config)

Topic		Replies	Views
Llama3 so much slow compared to ollama 🤗Transformers	15	10017	February 28, 2025
Llama32-11b inferencing took 6 minutes to answer 🤗Transformers	7	381	November 2, 2024
Models slow on M1 Pro 16gb Beginners	0	729	December 18, 2023
Best way to deploy a SLM/LLM model. Best library and approach? Research	6	850	March 11, 2025
Llama2 response times - feedback Beginners	0	621	February 6, 2024

Llama 3 performance is 4 mins. can get it in seconds?

Quantization with Transformers

Ollama

Related topics