Hello Everyone,
Model - llama_3_8B_Instruct
Hardware - VM 32GB RAM and 16 GB (one GPU-Tesla T4)
I want to extract structured information in JSON format from the email input data. I used transformer library to implement the model. I set max_new_tokens as 300 and temperature as 0.01 for my output generation. Apart from pre and post processing of input data, model.generate() method takes long time to execute, so my output response time around 4 mins.
GPU utilization shows 80 percent when running the model.
Do you guys think that its reasonable processing time?
Is it possible reduce the output generation time under 1 min or in few seconds? (without GPU upgrade)
Kindly pour your thoughts on it?
Thanks.