I’m building a RAG model, but I’m encountering issues with response quality and speed. When I use the Mistral AI LLM, the answers aren’t very accurate. Switching to the LLaMA model yields valid answers, but the response time is slow, taking around 7-10 seconds. Additionally, the LLaMA model is quite large and challenging to load. Could anyone suggest ways to improve both accuracy and performance, while also addressing the loading time of large models?
Hello. There are various variations of both Mistral and Llama, so it’s hard to say much without knowing which one you’re using, but quantization is useful for loading large models with relatively little memory.
You might be worried about accuracy, but the truth is that there’s not much difference. If the original model is good, you should be able to get good results.
The method you use will depend on the software you’re using.
The difficult part is probably speeding it up. To do this, you either need to improve the performance of your machine, or train smaller, faster models to be smarter, or use difficult methods such as parallel processing.
Training small models to be smarter requires a lot of know-how and a little money.
It might be cheaper to use someone else’s trained model, which is available in abundance on HF.