Yeah, yesterday I was working with Ollama as Ollama runs a **single inference session per model instance.**When multiple requests hit that same instance, Ollama queues them and processes one at a time — there’s no parallel token generation inside one model. So that’s the drawback of it. So I was thinking to run model locally using libraries like Transformers, vLLM, or Text Generation Inference (TGI).
1 Like