Especially for mobile phone users, since IP addresses constantly change, it might be better not to use IP as a key.
Why your case is tricky (and why the usual “intent classifier first” fails)
You have a closed-domain KB (a finite set of grammar tense topics) and you want strictly KB-grounded answers. In this setting, “intent classification” is often the wrong gate, because many valid user turns are underspecified:
- short topic-only inputs:
simple past tense
- elliptical follow-ups:
give example
- comparison phrasing:
difference between past and present tense
- typos in a key anchor word:
tesne
Conversational retrieval research describes this as query underspecification (ellipsis / anaphora / topic return). A standard remedy is query resolution (carry forward missing context) rather than rejecting the turn as out-of-domain. (arXiv)
Practical frameworks expose this explicitly as a “condense / rewrite then retrieve” step (e.g., “Condense Question Mode”), but issues show that rewriting can introduce latency and failure modes if applied blindly. (LlamaIndex)
For your constraints (small KB, deterministic answers), the best practice is:
Resolve topics first (entity-linking to your KB).
Decide “in-domain/out-of-domain” only after you try to link the query to your known topics.
Target design: “operation + topics” (deterministic orchestration)
Treat each user turn as:
- operation:
DEFINE | EXAMPLES | COMPARE | LIST
- topics: one or more canonical topic IDs from your KB
Then:
- Retrieval is trivial:
topic_id → JSON entry
- Multi-topic answers are guaranteed by construction
- Follow-ups are resolved by state (
last_topics, last_operation)
This is essentially a closed-domain version of conversational query resolution. (arXiv)
Step 0: Structure your JSON so “difference” can be deterministic
A comparison answer is easiest if each tense entry has fields you can place side-by-side (no LLM generation needed).
Example schema (illustrative):
{
"id": "simple_past",
"title": "Simple Past Tense",
"aliases": ["past simple", "simple past", "simple past tense"],
"definition": "...",
"form": "V2 (regular: -ed; irregular: varies)",
"time_reference": "past",
"common_uses": ["completed action in the past", "series of completed actions"],
"signal_words": ["yesterday", "last week", "ago"],
"examples": ["I visited Kyoto yesterday.", "She cooked dinner."]
}
If you already have a simpler KB, you can still add these fields incrementally and fall back to “definition + examples” when a field is missing.
Step 1: Handle small spelling mistakes without heavy NLP
Recommended lightweight combo
- SymSpell for fast token-level corrections of domain keywords
SymSpell is designed for very fast spelling correction via the “symmetric delete” approach. (GitHub)
- RapidFuzz for phrase/topic matching against your controlled vocabulary
RapidFuzz provides fast fuzzy string matching and is a common choice for matching user phrases to canonical topic names/aliases. (RapidFuzz)
Practical method (minimal, safe)
-
Build a SymSpell dictionary from:
- all topic titles + aliases
- core grammar terms:
tense, past, present, future, simple, perfect, continuous, progressive
-
Only correct tokens that:
- are alphabetic
- length 3–12
- not already in your domain dictionary
-
Apply correction before regex/topic parsing.
This fixes your specific class of errors (tesne → tense) without pulling in large NLP stacks.
Why fuzzy matching still matters
Even with SymSpell, users can type partial phrases or mix word order. Fuzzy matching to your alias list is extremely effective in a closed domain.
Step 2: Handle short topic-only inputs (stop rejecting them)
Best practice: retrieval-first domain routing
Instead of:
intent_classify → (maybe) retrieve
Do:
topic_link → score → decide in-domain → answer
Because short queries lack the verbs and structure that many classifiers expect.
A reliable topic-linking stack (ordered)
-
Exact / alias match (fast path)
-
Greedy longest-match phrase resolution
- prefer
simple past tense over past tense if both match
-
Fuzzy match via RapidFuzz against title + aliases (RapidFuzz)
-
Embedding similarity (your current LLM semantic matching)
-
TF-IDF fallback (your existing method)
Scoring and thresholds (practical)
Keep it simple and stable:
If accepted, default operation = DEFINE unless the query contains example cues (example, sample sentence, etc.).
This single change usually eliminates “topic-only inputs get classified out-of-topic”.
Step 3: Enforce multi-topic answers for “and” / “difference” queries
Background: multi-topic collapse is a known retrieval failure mode
Single-pass retrieval often returns evidence for only one facet unless you decompose/fan out. Query decomposition is a common RAG optimization for complex or multi-part questions. (Haystack)
In your closed-domain KB, you don’t need LLM decomposition
You can parse operators deterministically.
Operator detection (simple rules)
Topic extraction strategy per operator
-
COMPARE:
- parse
difference between X and Y
- resolve topic for X and topic for Y independently
- guarantee exactly 2 outputs
-
MULTI:
- split on
and / commas into segments
- resolve a topic per segment
- return N outputs in the same order as mentioned
If you still only find one topic (common with typos/short segments)
Run a “rescue pass”:
- apply SymSpell corrections again
- alias expansion:
past → past tense
- embedding-based candidate recovery on each segment
- if still 1 topic in a COMPARE query: return that answer + the closest second match (by score) as a “best guess”
(Optional) Improve recall with multi-query + fusion
If your semantic matching sometimes misses one side of a comparison, generate a few query variants (rephrases) and fuse results using Reciprocal Rank Fusion (RRF). RRF is a standard method for combining ranked lists from multiple retrieval runs. (G. V. Cormack)
You can do this without answer generation—only for retrieval robustness.
Step 4: Follow-up questions (“give example”) without brittle rewriting
Background: rewriting is useful, but can be harmful if always-on
Some frameworks implement “condense question” as a mandatory step; user reports show pain around not being able to skip it, query rephrasing showing up, or introducing delays. (GitHub)
LlamaIndex documents the same general two-step pattern (condense → query). (LlamaIndex)
For your case (small, structured KB), deterministic follow-up handling is often better.
Minimal conversation state (store server-side)
last_topics: list of topic IDs returned last turn
last_operation: DEFINE/EXAMPLES/COMPARE
turn_id or timestamp (optional)
Follow-up resolution rules (deterministic)
-
If the new query contains example intent and no topic:
- operation =
EXAMPLES
- topics =
last_topics (if 1, easy; if >1, return examples for each)
-
If the new query contains a topic phrase:
- ignore
last_topics and resolve the new topic(s)
-
If the new query contains compare intent like difference and no explicit topics:
- if
last_topics has 2, compare those
- else ask a targeted clarification (only then)
This mirrors the “carry forward missing terms” philosophy in query resolution work, but implemented with rules (no heavy models). (arXiv)
Step 5: Fix session memory (do not key by client IP)
Keying memory by client IP is fragile (NAT/shared IPs, changing IPs, proxies) and can mix user sessions.
Use:
- a random session ID in a cookie, or Flask’s session mechanism
- server-side storage for conversation state (Redis / database) if you scale
Also set secure cookie attributes:
Flask documents Secure, HttpOnly, SameSite and shows how to set them on the session cookie. (Flask)
OWASP’s session management guidance emphasizes secure handling of session identifiers and cookie protections like HttpOnly. (OWASP Cheat Sheet Series)
A concrete pipeline (putting it all together)
1) Normalize + typo repair
- lowercase
- strip extra punctuation
- SymSpell-correct selected tokens (domain dictionary)
2) Detect operator and action
3) Resolve topics (always attempt)
- exact/alias → fuzzy → embeddings → TF-IDF
- greedy longest-match to prefer specific tenses
4) Enforce multi-topic outputs
- MULTI: require
len(topics) >= 2 (rescue pass if needed)
- COMPARE: require exactly 2
5) Render deterministically from JSON
- DEFINE: definition + form + 1–2 examples
- EXAMPLES: list examples
- COMPARE: 2 definitions + a comparison table (time_reference/form/common_uses/signal_words)
Testing and evaluation (what to measure so it doesn’t regress)
Even though your domain is grammar tenses, you can borrow evaluation ideas from conversational retrieval benchmarks:
- TREC CAsT track overviews discuss conversational query reformulation and show that manual resolution often beats automatic, motivating careful handling of underspecified turns. (arXiv)
- CANARD formalizes “question-in-context rewriting” and highlights ellipsis/coreference as the core difficulty. (ACL Anthology)
Practical regression suite (high impact)
Create a small YAML/JSON test set:
- short queries:
simple past tense, present perfect
- follow-ups:
give example after each topic
- multi-topic:
present and future tense
- compare:
difference between past and present tense
- typos:
tesne, prsnt, contnuous
- mixed:
difference between simple past and past continuous
Track:
- detected operation
- extracted topics (IDs)
- number of returned sections
- confidence scores
Common pitfalls to avoid (directly tied to your symptoms)
- Intent gate before topic linking → rejects short valid queries
- Top-1 retrieval for multi-topic → collapses to one answer (must fan out) (Haystack)
- Always-on rewrite/condense → can distort short follow-ups and add latency; many users try to skip/customize it (GitHub)
- No longest-match preference →
past tense steals matches from simple past tense
- IP-based session memory → mixes users; switch to real session IDs and secure cookie settings (Flask)
Recommended minimal dependency set (lightweight)
rapidfuzz for fuzzy topic/alias matching (RapidFuzz)
symspellpy (SymSpell) for fast typo correction (GitHub)
- your current embedding/LLM semantic matching + TF-IDF fallback
- Flask session configuration with secure cookie flags (Flask)