Yeah, yesterday I was working with Ollama as Ollama runs a **single inference session per model instance.**When multiple requests hit that same instance, Ollama queues them and processes one at a time — there’s no parallel token generation inside one model. So that’s the drawback of it. So I was thinking to run model locally using libraries like Transformers, vLLM, or Text Generation Inference (TGI).
1 Like
Oh. When handling data with long context lengths, TGI or vLLM are reliable and fast. Of course, there are no issues with quantization.
TGI is particularly good for load balancing.
1 Like
I agree focusing on open source based local AI is increasing privacy - if you don’t know this repo already, I find it useful and inspiring LM Studio · GitHub
TGI sounds like a good fit for your app don’t you think?
1 Like