[Discussion] Beyond Text Merging: Exploring Composition of KV Caches from Parallel Agent Threads

Hello Hugging Face community,

I’m researching advanced architectures for AI agents and have a question about composing states from parallel execution paths.

The standard pattern for merging information from parallel agent threads (e.g., two lines of reasoning) is to take their text outputs, concatenate them into a new context, and re-process everything with a full forward pass.

I’m looking for methods that could operate directly on the model’s internal state. Specifically, the idea of composing or “stitching” the KV caches from two separately processed sequences.

Motivation: Directly merging KV caches would preserve the internal latent reasoning states from both threads, rather than forcing the model to regenerate them from scratch via reprocessing. This could avoid redundant computation and retain subtle intermediate representations that might otherwise be lost in a text-only merge. The benefit becomes especially critical for use cases with very long context windows (e.g., 1M tokens), where re-encoding the entire history would be prohibitively expensive.

Potential Applications:

  • Parallel Reasoning: Efficiently merging the state of an agent that has speculatively executed multiple reasoning paths in parallel.

  • Hierarchical Memory: Loading a compressed “memory block” (represented by a KV cache) into an active reasoning context without full re-computation.

Known Challenges: This approach is non-trivial within the standard transformer design. Naïve concatenation of KV caches introduces two major issues:

  1. Positional encoding misalignment — each sequence is indexed from zero, so merging them produces inconsistent position references and a corrupted representation;
  2. Attention chain disruption — causal attention relies on a coherent progression of tokens, which breaks when independent reasoning threads are stitched together.

My Core Question: Given these constraints, what is the state-of-the-art in this area? Are there architectural modifications (e.g., different positional encoding schemes, attention variants) or specific techniques that make this kind of state composition more feasible?

I’m particularly interested in any papers that discuss approximations or alternative methods to achieve a similar outcome.

Thanks for any insights you can share!

1 Like

It seems like quite a difficult challenge

1 Like

Thank you so much! Your summary is really insightful and comprehensive. Looks “Use a relocatable PE + RoPE shift“ can work really well given RoPE’s nature of relative position. For link token, I’m not sure whether we really need it, if we do attention in right way in following normal tokens. The effect seems to be close. What do you think?

1 Like

I’m not at all knowledgeable about the theory, but anyway, it seems to be something like this?

1 Like