Saving tokens by system prompt

ak400400 · April 22, 2025, 10:21am

In Ollama, there’s an option to create a custom system prompt using a Modelfile, and there’s also the option to use an existing model and pass the instruction as a prompt. Which one will be faster in terms of token processing? Does a system prompt retain the processed tokens between sessions?

John6666 · April 22, 2025, 11:07am

It seems to be getting faster.

github.com/ollama/ollama

Delays and slowness when using mixtral

opened 11:53PM - 15 Dec 23 UTC

closed 12:40AM - 12 Mar 24 UTC

djmaze

It seems as the context grows, the delay until the first output is getting longe…r and longer, taking more than half a minute after a few prompts. Also, text generation seems much slower than with the latest llama.cpp (commandline). Using CUDA on a RTX 3090. Tried out `mixtral:8x7b-instruct-v0.1-q4_K_M` (with CPU offloading) as well as `mixtral:8x7b-instruct-v0.1-q2_K` (completely in VRAM). As a comparison, I tried `starling-lm:7b-alpha-q4_K_M`, which seems not to exhibit any of these problems. Sorry for the unprecise report, running out of time right now. Does anyone have a similar experience with Mixtral? Or is this expected behaviour with ollama? (First-time user here.)

by Hugging Chat: https://huggingface.co/chat/

In Ollama, using a custom system prompt via a Modelfile can be faster in terms of token processing per request because the prompt is loaded once during model initialization, reducing the need to reprocess it each session. This setup also retains the system prompt between sessions if the model remains in memory. However, the initial setup may take longer, and the retention of processed tokens depends on Ollama’s configuration and whether it’s run as a persistent service.

Answer

Speed: Using a custom system prompt in the Modelfile is generally faster because the prompt is processed once during model initialization, avoiding the overhead of reprocessing it with each new prompt [1][2].
Token Retention: The system prompt is retained between sessions as it’s part of the model’s persistent setup. However, Ollama doesn’t automatically retain processed tokens unless configured to maintain context across sessions [2][3][1].

References
[1] The use of a Modelfile reduces the need to reprocess the system prompt each time [1].
[2] Persistent services in Ollama can maintain context, affecting token retention [2].
[3] The Modelfile includes the system prompt, which is loaded once [3].

Topic		Replies	Views
Speed when running many prompts Beginners	0	200	December 1, 2023
Finding system prompts and special tokens for open LLMs Models	0	1455	May 17, 2023
Llama3 so much slow compared to ollama 🤗Transformers	15	9949	February 28, 2025
Mixtral batch inference or in general fast inference Beginners	2	3897	February 26, 2024
Llama 2 10x slower than LLaMA 1 🤗Transformers	1	724	November 7, 2023

Saving tokens by system prompt

Related topics