Question About the Practicality of the Context Length

Hello to everyone on the forums as this is my first post here :slight_smile:

I have tried to find a topic or topics that would include the answer to my question, but as I didn‘t find any, I decided to create this topic.

I have a question regarding the context length, referring to its practicality in particular.

Now, you can call me a malcontent, but isn‘t the context length of 4096 tokens that you can set for LLaMA2 or Falcon (not to mention 2048 tokens) way too little for a longer conversation?

What I mean here is that I am using this model maddes8cht/ehartford-WizardLM-Uncensored-Falcon-40b-gguf · Hugging Face by Mathias Bachmann, known as @maddes8cht. I use LM Studio for macOS and if throughout a conversation with the model I pass shorter and longer portions of text, over time I quickly not only reach the context length limit of either 2048 or 4096 tokens (because 2048 seems to be the length accepted by standard, whereas 4096 is what I simply tried myself), but even exceed it. This has happened to me more than once, and is my assumption, based on observation, correct that as much as you can continue the conversation even if you exceed the context length limit, the AI chat does not remember the discussion that was before exceeding the context length limit? Also based on observation, I have a feeling that once the context length limit is reached first and then exceeded, whatever that is discussed after the exceeding is counted just as if you started from 0 tokens and onwards again (in other words, starting anew). Working this way, the AI chat doesn‘t seem to remember what was discussed earlier, and by itself it is unable to refer to the earlier discussion. If, for example, you ask it about something that was discussed or mentioned earlier, it says (one way or the other) that it has not noticed such a thing.

I have been trying to write to @maddes8cht to ask him what is the exact maximum context length for his model (because I was unable to find specific information on the matter pertaining to the Falcon architecture used in particular), but neither HuggingFace, GitHub nor Twitter (now X) supports sending private messages. Perhaps someone here knows what is the maximum context length for the Falcon architecture? I know that 2048 and 4096 both work fine, but when for the purpose of testing I doubled the latter and used 8192, the model began generating gibberish. Since 8192 looks to be too much already, trying out even greater values is pointless.

Am I right that the AI chat does not remember what was discussed earlier once you exceed the context length limit, and don‘t such limitations make the AI chats more of toys for fun rather than practical tools for work?

two things can happen. your parameters set when loading the model can limit the context, and the model itself has a limit.

particularly when you exceed the limit set in parameters your query is cut off in the middle, then it doesn’t have proper closing quote… this screws things up and who knows what data the model actually gets, in this case.

falcon was originally trained for 2048…

consider also that “memory” of previous conversation also fills up the context… meaning the longer conversation you have the harder the model has to work, and this too can overfill your context

Thank you for your reply.

Yes, I have noticed that every token from the overall conversation as such uses up the whole number of tokens available for the conversation, responses of the AI chat included. Considering the fact that its responses may sometimes be somewhat longer, that can significantly use up the number of tokens available at the user‘s disposal.

I am curious what parameters exactly you mean by saying:

As far as I have got to know working with models in LM Studio for macOS (I can guess that the parameters that can be set for the model are fairly the same when working with other tools), I assume that one such parameter that you may have in mind here is “Tokens to generate”, as it is called in LM Studio, defined in it as “the number of tokens the model generates in response to the input prompt”. I have noticed that increasing the value of this parameter to a certain positive integer, which in this case is other than the default -1, allowing the model stop generating the response by itself, does not necessarily mean that the model will, in fact, generate in its response the exact number of tokens set as the parameter value. Setting, for example, 500, I noticed that the model would generate about 500 tokens, but still less than 500 tokens exactly.
If my guess is correct, what other parameters can limit the context length? I assume it is not the parameter controlling the randomness or determinism of the response (although I may be wrong), but what other parameters can affect that?

I can also add that I have changed the context length from 4096 back to 2048 tokens, and although the model may remember less of the conversation context now, its responses seem to be more down to the matter and concrete. Of course, that means passing to it shorter input prompts at a time, but if the responses are to be more concrete thanks to that, it certainly is acceptable