I learnt a little about LLMs and know that they just loop through the conversation many times and generate a token each time. Is it somehow possible to detect a sequence in the generation and dynamically append context?
Some background information
I want to build agentic chatbots, cheaply. Here’s the problem:
Let’s say that input is $3/Mtok and we have 10K tokens. The input cost is 3 cents
I want to have the chatbot retrieve the necessary information, and perform actions, but it is not very efficient. 5 or 10 tool calls may be ok but over time 100s will cost lots, not counting reasoning tokens and output. So since I know that LLMs just loop while generating content, I want to try to use opensource models to do the job, and when tool calls are detected, just append to the beginning of the message.
I know I can stop the generation and restart it with context but is there a more efficient way. Maybe this is related to why LLMs have a longer time to first token than token per second (as restarting generation would be like again pausing for the time to first token)
To build an efficient and cost-effective agentic chatbot with dynamic context modification during generation, consider the following approach, drawing insights from the provided sources:
Dynamic Context Augmentation with RAG: Integrate Retrieval-Augmented Generation (RAG) to dynamically retrieve and append relevant information to the context when needed. This avoids frequent expensive tool calls by augmenting the model’s knowledge in real-time [1].
Efficient Context Pruning with LazyLLM: Implement LazyLLM to dynamically prune unnecessary tokens during prefilling and decoding. This keeps the context focused on generating the next token, optimizing resource usage and reducing the overall context length [3].
Resource Decoupling with Infinite-LLM: Utilize the approach from Infinite-LLM to decouple attention layers from the rest of the model, enabling flexible and efficient resource scheduling. This allows dynamic context modifications without restarting the generation process, saving time and resources [2].
Tool Call Detection and Context Update: Monitor the generation process for triggers indicating a need for tool calls. When detected, append the necessary information to the beginning of the message and update the KVCache, allowing the model to continue generation smoothly without interruption [2][3].
By combining these techniques, you can create a chatbot that efficiently modifies its context dynamically during generation, reducing costs and improving performance. The strategy focuses on minimizing tool calls, optimizing context length, and enhancing resource management, all of which contribute to a more efficient and scalable solution.
This approach aligns with current advancements in dynamic context handling, leveraging both pruning and resource decoupling to maintain efficiency while ensuring that the chatbot remains cost-effective and responsive.
I already know about RAG. I’m talking more about efficiency
For RAG I’d have to do 2 requests, but I want to do it with one call, effectively using less requests
I do not think what you want to achieve is possible without the model being able to explicitly do routing or gating based on the input. If you can modify the model structure you could achieve this with a gating mechanism. This would be the contextual change you are seeking based on 1 input that could be split into many different inputs internally. You would need some sort of marker to inform the gate on when 1 input ends and another starts but that can easily be achieved with a marker or tag. You also could do this with strait python by preprocessing the inputs before passing them into the model. But this would all need to be built in.