Can Proactive Recall Replace Query-Based retrieval in persistent AI memory?

**I’ve been working on MnemoCore, an open-source infrastructure for persistent AI
cognition. The goal is not just session-level memory but a long-term cognitive
substrate that survives across interactions and agents.

A few of the core design choices so far:

  • 3-tier Hot/Warm/Cold storage with temporal decay modelling
  • HDV/VSA encoding via a companion module called HAIM
    (16,384-dimensional binary vectors with XOR-binding)
  • Per-trace source provenance tracking

Most existing approaches (RAG, MemGPT, and similar) are reactive. Memory is
only retrieved when the system receives an explicit query. That works fine for
lookup tasks, but it misses something important: humans don’t wait to be asked
before a relevant memory surfaces. Association happens continuously, in the
background, triggered by context.

So my research question is this:

Can a persistent AI memory system proactively surface relevant memories, without
an explicit retrieval query, by monitoring ongoing inference in real time?

Some directions I’m exploring for the triggering mechanism:

  • Attention pattern shifts between inference steps
  • Cosine drift thresholds in the embedding space
  • Topic boundary detection as a signal for memory injection

I’d love to hear thoughts from anyone who has worked on similar problems. Is
proactive recall architecturally feasible at inference time without prohibitive
overhead? Are there papers or projects I should be looking at that tackle this
from a different angle?
**
GitHub - RobinALG87/MnemoCore-Persistent-Cognitive-Ai-Memory: MnemoCore v2.0.0-beta Latest**

Thanks in advance.**

1 Like

for now:


Bottom line

Proactive recall at inference time is architecturally feasible and already appears in multiple “dynamic / active RAG” lines of work (token-level or step-level retrieval during generation). (arXiv)

However, it usually does not replace query-based retrieval; it wraps it with (1) a trigger (“should I retrieve now?”) and (2) a cue/query constructor (“what should I retrieve given this context?”). The practical win is: you retrieve only when needed and inject only what helps, rather than always retrieving on user demand.


What you’re describing already exists (under different names)

1) Token-/generation-time retrieval (closest to “proactive recall”)

These systems monitor the ongoing generation and decide to retrieve mid-stream:

  • FLARE (Forward-Looking Active Retrieval Augmented Generation): retrieves while generating by looking ahead at likely hallucination spans, then uses those spans to retrieve evidence. (arXiv)
  • DRAGIN (Dynamic Retrieval Augmented Generation): explicitly frames this as real-time information needs detection + query formulation during generation (their repo describes RIND + QFS). (GitHub)
  • Dynamic & Parametric RAG (2025): formalizes dynamic “when/what to retrieve” during generation to adapt to non-stationary settings. (ACM Digital Library)

Takeaway: “proactive recall during inference” is already demonstrated as workable; the key differentiator is your memory substrate (persistent, associative, HDV/VSA) and the cost profile (binary ops + tiering).


2) “Should we retrieve?” routers / controllers (cheap triggers)

If you want proactive recall without big overhead, the strongest pattern is: a small gate decides whether to retrieve, instead of running retrieval continuously.

  • Self-RAG: uses special tokens / reflection to decide whether retrieval is needed and to critique groundedness. (arXiv)
  • Unified Active Retrieval (UAR): proposes multiple criteria for retrieval timing with “negligible extra inference cost” via plug-and-play classifiers. (arXiv)
  • Adaptive-RAG: routes among no-retrieval / single-step / multi-step retrieval based on predicted complexity. (ACL Anthology)

Takeaway: your “triggering mechanism” should likely look like a router (fast) + budget (how many recalls allowed per turn).


3) Persistent agent memory systems (proactive-ish recall, but usually step-based)

  • Generative Agents: has a memory stream + retrieval scored by relevance/recency/importance to surface memories for behavior planning. (arXiv)
  • MemoryBank: long-term memory with continual updates and “summon relevant memories.” (arXiv)
  • MemGPT (and its successor ecosystem): OS-like tiered memory + “interrupts” for control flow. (arXiv)

Takeaway: these validate the product goal (persistent cognition), but they’re often less explicit about token-level triggers than FLARE/DRAGIN.


Mapping to your design (what you have is unusually compatible)

Your stack already implies a controller + memory algebra + tier manager architecture (including HOT/WARM/COLD and a “Dream Loop” concept). (GitHub)
You also already have the key HDV/VSA primitives—XOR bind, majority bundling, permutation, Hamming similarity—and explicit discussion of context binding/masking via XOR at store/query time. (GitHub)

That combination is unusually well-suited for proactive recall because:

  1. Triggering can be cheap (no gradient; no heavy reranker).
  2. Candidate retrieval can be extremely fast in binary space (XOR + popcount), and can scale with standard binary ANN indexing. (GitHub)

A practical architecture for proactive recall (inference-time) without prohibitive overhead

A) Separate the problem into 3 loops

1) Fast “Need-to-recall?” loop (runs every step or every few steps)

Goal: produce a boolean + a small “recall budget” number.

Good low-cost signals (in increasing intrusiveness):

  • Generation uncertainty proxies (logprobs, entropy of next-token distribution) — widely used in dynamic retrieval work conceptually (DRAGIN explicitly detects “information needs” in real time). (GitHub)
  • Embedding drift on rolling windows (sentence/chunk embeddings)
  • Topic boundary detection (segment shifts) — classic baseline is TextTiling; newer work also uses LMs for segmentation. (ACL Anthology)
  • Attention-pattern shifts (JSD/KL between attention distributions) — feasible only if you control the model runtime and can access attention maps; not available in most hosted APIs. (JSD-style comparisons are used in attention analysis contexts.) (ACL Anthology)

Implementation tip: run the trigger every N tokens (e.g., 16–64) instead of every token unless you’re doing DRAGIN/FLARE-style token retrieval.

2) Cue construction loop (“what do I recall?”)

Turn the current inference state into a retrieval cue.

Robust options:

  • Salience extraction: nouns/entities + key phrases from the last window
  • Self-query: a tiny prompt to a small LM: “Write a search cue for memories relevant to what you’re doing now.” (This mirrors DRAGIN’s query formulation idea, but can be cheaper than attention-based query formulation.) (GitHub)
  • HDV cue: bundle/permutation-bind the last-k salient token vectors (fits your encoding pipeline). (GitHub)

3) Memory injection loop (“how do I use what I recalled?”)

This is where many systems fail: retrieved text can harm generation.

Use a strict format:

  • Top-k very small (often 1–5)
  • Summarize + provenance (you already track provenance per trace)
  • Explain why it’s relevant (short, 1 line)
  • Mark as memory, not instruction (security + confusion reduction)

B) Keeping retrieval fast in your HDV/VSA space

Because you store 16,384-bit binary vectors and use Hamming distance, you can use mature binary ANN tooling:

  • Faiss supports binary indexes (IndexBinaryFlat / IVF / HNSW, etc.) and stores vectors compactly as bytes. (GitHub)

This matters because proactive recall increases retrieval frequency. Binary ANN cuts the marginal cost of each recall so the router can be more liberal.


Trigger mechanisms: how your proposed ideas compare

1) Attention pattern shifts

Pros

  • Potentially very sensitive to “model is changing what it attends to,” which can correlate with topic shift or uncertainty.

Cons

  • Often inaccessible in production (hosted APIs)
  • Noisy across layers/heads; needs careful aggregation (e.g., JSD/KL across heads/layers). (ACL Anthology)

Recommendation

  • Treat attention shift as an optional signal when running locally; don’t make it your only trigger.

2) Cosine drift thresholds in embedding space

Pros

  • Available even with black-box LLMs (you can compute embeddings externally)
  • Easy to tune, cheap

Cons

  • “Drift” doesn’t always mean “need memory”; it can mean stylistic shift, role switch, or harmless digression
  • Needs hysteresis (avoid oscillation)

Recommendation

  • Combine drift with a second signal (uncertainty, task switch, or entity novelty).

3) Topic boundary detection

Pros

  • Humans recall strongly at event boundaries; segment boundaries are a good place to inject memory without derailing token-level coherence
  • You can start with classic methods like TextTiling and then move to embedding/LLM-based segmentation. (ACL Anthology)

Cons

  • Boundaries are easier in expository text than in chat; conversational segmentation is harder.

Recommendation

  • Use boundaries to trigger prefetch + candidate scoring, and inject only if relevance is high.

“Replace query-based retrieval?” — what’s realistically true

What proactive recall can replace

  • The user having to ask “remember when…”
  • Many cases where the system should surface constraints, preferences, ongoing goals, or known facts without explicit prompting

What it cannot fully replace

  • Explicit lookups (“find the exact paragraph / exact date / exact source”)
  • Cases where the user’s request defines the retrieval objective (classic QA)

Best framing: proactive recall replaces the retrieval interface (explicit query), not retrieval itself. Internally you still compute a cue/query—just derived from context.


Pitfalls (these show up repeatedly online)

1) Recall spam / context bloat

If proactive recall injects too often, performance drops (confusion, instruction collisions). This is why UAR / Adaptive-RAG emphasize selective retrieval. (arXiv)

Mitigation

  • Hard budgets (max recalls per turn)
  • Cooldowns (no recall for N tokens after an injection)
  • “Memory compression”: inject summaries, not raw traces

2) “Wrong memory at the wrong time” is worse than no memory

Humans also mis-recall; in LLMs it can derail the whole completion.

Mitigation

  • Require two-factor relevance: (a) high similarity + (b) matches current entities/goal tag
  • Keep an abstain option

3) Security: memory poisoning / injection

Agent memories can be attacked through interaction; MINJA is one example of query-only memory injection attacks against agent memory banks. (OpenReview)

Mitigation

  • Provenance + trust scoring (you already have provenance)
  • Filter out instruction-like content from memory injection
  • Consider drift/injection detectors as a guardrail (e.g., embedding-drift-based defenses have been proposed for prompt injection). (OpenReview)

4) Evaluation is non-trivial

Benchmarks exist, but they don’t capture everything.

  • LoCoMo: very long-term conversational memory across many sessions. (arXiv)
  • LongMemEval: multiple memory tasks including temporal reasoning and abstention. (OpenReview)

Recommendation

  • Evaluate proactive recall on: (1) correctness, (2) intrusion rate (how often injection hurts), (3) latency, (4) memory safety.

Why your HDV/VSA choice is a strong match for proactive recall

Your binary VSA core is aligned with the classic HDC/VSA rationale:

  • High-dimensional random vectors are near-orthogonal; binding/bundling support robust associative operations (Kanerva; survey literature). (redwood.berkeley.edu)
  • “Associative recall” is the central promise of HDC/VSA; your use of XOR binding and Hamming similarity is directly in that tradition. (GitHub)

There’s also an interesting bridge: modern Hopfield networks show that transformer attention can be viewed as associative memory retrieval. That gives a conceptual justification for attention-based triggering/cueing, even if you don’t rely on it operationally. (arXiv)


Concrete implementation suggestions for your case

1) Make proactive recall a two-stage gate

  • Gate 1 (cheap): topic boundary OR drift spike OR uncertainty spike
  • Gate 2 (precise): HDV similarity + context mask match (your XOR contextual masking idea fits here) (GitHub)

2) Prefer event-boundary injection, not token-by-token injection

Use token-level retrieval (FLARE/DRAGIN style) only for tasks that need factual grounding mid-sentence (e.g., citations). (arXiv)

3) Use binary ANN for WARM/COLD

Faiss binary HNSW/IVF reduces the “recall tax,” which is the main reason proactive recall becomes expensive at scale. (GitHub)

4) Add a negative cache

If a cue retrieves memories that are later judged unhelpful, cache that (cue-hash → “do not inject”), to reduce repeated bad recalls.

5) Treat memory injection as evidence, not instructions

Given agent-memory attack results, keep memory in a structured “facts/evidence” block, never as executable directives. (OpenReview)


Curated list of high-signal things to read / clone

Dynamic / proactive retrieval during generation

Persistent memory systems (architectural reference points)

  • MemGPT (arXiv)
  • Letta (current implementation ecosystem; issues show real-world friction points) (GitHub)
  • MemoryBank + code (arXiv)
  • Generative Agents (arXiv)
  • RET-LLM (arXiv)

HD/VSA / associative memory background

Benchmarks for long-term memory

Memory security / poisoning


Answering your core question directly

Is proactive recall feasible at inference time without prohibitive overhead?

Yes, if:

  1. you make retrieval conditional (router/gate), and
  2. you keep retrieval cheap (binary ANN + tiering), and
  3. you treat injection as a scarce resource (budgets + strict formatting).

Token-level proactive retrieval (FLARE/DRAGIN) is feasible but can be expensive; event-/segment-level proactive recall is usually the best latency/benefit tradeoff. (arXiv)

Can proactive recall replace query-based retrieval?

It can replace the need for explicit user queries for many memory-use cases, but not replace explicit lookup workflows. In practice the best systems are hybrids: proactive recall for continuity + query-based retrieval for precision.