What is the context length when using ollama to pull HF GGUF

ijunglee · March 31, 2025, 6:46am

Hi,

I have used Ollama to pull HF GGUF model for local usage according to this post: Use Ollama with any GGUF Model on Hugging Face Hub

The model I have pulled is bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF · Hugging Face

But I am not sure what is the context length for the pulled model. Is it also as the default as the models from ollama, which is 2048? Or is it the max context length of the model (131072 for this model)

Thanks!

John6666 · March 31, 2025, 7:46am

It seems that the default is 2048.

github.com/ollama/ollama

docs/faq.md

main

# FAQ

## How can I upgrade Ollama?

Ollama on macOS and Windows will automatically download updates. Click on the taskbar or menubar item and then click "Restart to update" to apply the update. Updates can also be installed by downloading the latest version [manually](https://ollama.com/download/).

On Linux, re-run the install script:

```shell
curl -fsSL https://ollama.com/install.sh | sh
```

## How can I view the logs?

Review the [Troubleshooting](./troubleshooting.md) docs for more about using logs.

## Is my GPU compatible with Ollama?

Please refer to the [GPU docs](./gpu.md).

This file has been truncated. show original

By default, Ollama uses a context window size of 2048 tokens.

ijunglee · March 31, 2025, 9:00am

Hi @John6666 ,

Thanks for the info! Does it mean the context length for all models pulled by Ollama is always 2048 even if the model is from HF?

If that is the case, looks like I have to change the num_ctx manually after pulling the model

John6666 · March 31, 2025, 11:01am

It seems so.

github.com/ollama/ollama

Questions about context size

opened 09:30AM - 26 Jan 24 UTC

closed 01:06AM - 10 May 24 UTC

swip3798

Before I start, thank you for this amazing project! It's really great to run LLM…s on my own hardware this easily. I am currently building a small story writing application that uses ollama to have a "cowriter" AI, that will write along with the user, similar to how AIDungeon or NovelAI work. Since the stories have no limit in size, they will eventually become large than the context size of the model. This now has led me to multiple questions on how exactly ollama handles cases, where the prompt is larger than the context size of the chosen model. Will it get trimmed, and if yes how exactly? Is the template always in the context and just the prompt trimmed, or will it be cut off too? Or do I understand this completely wrong? Additionally the users of my app should be able to add a "long term memory", essentially just more text that will be put at the beginning of the prompt, so that the AI can have info of the story that is already outside of the context size. That of course makes it necessary, that this memory text will definitely be in the context of the model. Now, all of this would be fairly simple to implement myself, if there would be a tokenize/detokenize endpoint. I have seen the issues regarding that, so maybe this can also be achieved using the chat endpoint? But then again, what happens when the context size is exceeded? Sorry for all those questions at once, I would be really thankful, if you could share some insights on how this works.

Topic		Replies	Views
“Use this model”->Ollama: can't pull model with Q4 🤗Hub	4	92	May 15, 2025
How to use hugging face to fine-tune ollama's local model Beginners	7	8058	August 28, 2024
124gb vram model recommendation Beginners	1	26	June 12, 2025
Problem with launching DeepSeek-R1-Distill-Qwen-32B-Uncensored-Q8_0-GGUF Models	32	424	March 18, 2025
Hello experts please help on running local DeepSeek-R1-0528-Qwen3-8B Beginners	2	62	June 10, 2025

What is the context length when using ollama to pull HF GGUF

Related topics