How to improve Performance?

riddhi810 · November 5, 2024, 5:32am

Hello,

I’m building a RAG model, but I’m encountering issues with response quality and speed. When I use the Mistral AI LLM, the answers aren’t very accurate. Switching to the LLaMA model yields valid answers, but the response time is slow, taking around 7-10 seconds. Additionally, the LLaMA model is quite large and challenging to load. Could anyone suggest ways to improve both accuracy and performance, while also addressing the loading time of large models?

John6666 · November 5, 2024, 5:51am

Hello. There are various variations of both Mistral and Llama, so it’s hard to say much without knowing which one you’re using, but quantization is useful for loading large models with relatively little memory.
You might be worried about accuracy, but the truth is that there’s not much difference. If the original model is good, you should be able to get good results.
The method you use will depend on the software you’re using.

The difficult part is probably speeding it up. To do this, you either need to improve the performance of your machine, or train smaller, faster models to be smarter, or use difficult methods such as parallel processing.
Training small models to be smarter requires a lot of know-how and a little money.
It might be cheaper to use someone else’s trained model, which is available in abundance on HF.

system · November 6, 2024, 8:42am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Best LLMs that can run on 4gb VRAM Beginners	2	3080	January 22, 2025
Mistral load_in_8bit slow inference 🤗Transformers	0	243	May 24, 2024
Llama 3 performance is 4 mins. can get it in seconds? Models	2	495	March 24, 2025
Why the model loading of llama2 is so slow? 🤗Transformers	6	9482	April 24, 2024
Recommended hardware for running LLMs locally Beginners	2	33075	December 18, 2023

How to improve Performance?

Related topics