Hello Everyone,
Model - llama_3_8B_Instruct
Hardware - VM 32GB RAM and 16 GB (one GPU-Tesla T4)
I want to extract structured information in JSON format from the email input data. I used transformer library to implement the model. I set max_new_tokens as 300 and temperature as 0.01 for my output generation. Apart from pre and post processing of input data, model.generate() method takes long time to execute, so my output response time around 4 mins.
GPU utilization shows 80 percent when running the model.
Do you guys think that its reasonable processing time?
Is it possible reduce the output generation time under 1 min or in few seconds? (without GPU upgrade)
Kindly pour your thoughts on it?
Thanks.
1 Like
Experiencing a similar issue. Any suggestions or fixes are greatly appreciated.
1 Like
If you have 16GB of VRAM, it won’t be enough for the 8B model. Some of it will be offloaded to regular RAM, and it will be very slow. We recommend using quantization as follows, or using a program like Ollama that basically uses quantization.
Quantization with Transformers
pip install bitsandbytes
from transformers import pipeline, BitsAndBytesConfig
import torch
nf4_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16)
model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
pipe = pipeline("text-generation", model=model_id, quantization_config=nf4_config)
Ollama