Seems make sense but there are pitfalls…?
Framing your task correctly
What you call “question completion” is usually treated in the IR/RAG literature as question-in-context rewriting / conversational query rewriting (CQR):
- Input: conversation history + current underspecified question
- Output: a standalone question/query that can be sent to your retriever
This is the exact task defined by CANARD (“rewrite a context-dependent question into a self-contained question with the same answer”). (ACL Anthology)
It’s also a core subtask in QReCC, which evaluates rewriting + retrieval + answering end-to-end. (arXiv)
TREC CAsT baselines and team reports commonly use T5 fine-tuned on CANARD as the rewriting module in a conversational search pipeline. (trec.nist.gov)
That matters because the “best” completion isn’t the most fluent one—it’s the one that improves retrieval and downstream answer quality.
Does your 3-step pipeline make sense?
Yes, as a standard “teacher → synthetic labels → small student” approach
Your plan is a common pattern: use a powerful model to create rewrite labels, then SFT a small seq2seq model to reproduce them cheaply.
There are papers and open projects explicitly doing this for CQR:
- InfoCQR prompts LLMs as rewriters (and editors), then distills to smaller models for efficiency. (Diva Portal)
- SynRewrite (2025) constructs synthetic rewrites with GPT-4o and fine-tunes a Flan-T5 rewriter, then further aligns it using downstream feedback (DPO). (arXiv)
So the direction is reasonable.
The main condition: match what your rewriter will see at runtime
The strongest failure mode is training–inference mismatch: if the teacher uses information (gold answer, gold supporting passages, etc.) that your deployed rewriter won’t have, the student learns to “cheat” by injecting answer-y terms.
The biggest pitfalls in your specific plan
Pitfall 1: “multi questions that lead to a final answer” can accidentally leak the answer
If your constructed chains include later-turn facts too early (or include the final answer text in the context), the teacher will generate rewrites that embed answer-specific entities. That can look great offline but can hurt real usage.
Fix
- When generating labels, only provide exactly the same inputs the deployed rewriter will have (typically: recent dialogue turns + current question). Keep “final answer” out of the teacher input unless you also plan to provide it at runtime.
Pitfall 2: Book-derived “question chains” may not resemble real user follow-ups
Books are good domain text, but they are a weak proxy for conversational phenomena (pronouns, ellipsis, topic drift, corrections). CANARD/QReCC are valuable partly because they represent these phenomena explicitly. (ACL Anthology)
Fix
- Use your historical Q&A to anchor conversational structure.
- Use books more for domain adaptation (see DAPT/TAPT below), terminology coverage, and retrieval corpus—not as the primary generator of dialogue dynamics.
Pitfall 3: Optimizing for “nice” rewrites instead of retrieval-effective rewrites
Human rewrites can omit retrieval-critical context; LLM rewrites can be more informative and improve retrieval. InfoCQR is explicitly motivated by this mismatch and proposes “informative” rewrites plus distillation. (Diva Portal)
Fix
- Select/reward rewrites based on retrieval metrics (Recall@k / MRR / nDCG), not just text similarity.
Pitfall 4: Hallucinated specificity from the teacher becomes your student’s behavior
LLMs “helpfully” add constraints not justified by context.
Fix
- Generate K candidates per instance and filter.
- Add explicit constraints to the teacher prompt: no new entities unless present in the given context; preserve intent; keep it short.
Pitfall 5: T5-small may be harder to steer than you expect
For instruction-like behavior (“rewrite into a standalone question”), an instruction-tuned checkpoint is often easier to fine-tune than vanilla T5.
Fix
- Start from Flan-T5-small (released in the instruction-tuning work “Scaling Instruction-Finetuned Language Models”). (arXiv)
What I would do instead (a robust version of your pipeline)
Step 0 — Define the contract and add “NO_REWRITE”
Make your model learn:
- If the question is already standalone → output unchanged.
- If the context is insufficient → output unchanged (or a special token you handle upstream).
This matches how CQR datasets treat self-contained turns (rewrite can equal original). (assets.amazon.science)
Step 1 — Build an evaluation set before scaling synthetic data
Create a small dev/test set of real examples from your historical Q&A:
- history + follow-up + gold rewrite
- and ideally “known relevant chunk(s)” for retrieval evaluation (even partial)
Use CANARD/QReCC/TREC CAsT concepts as templates for what “good” looks like. (Google Sites)
Step 2 — Domain-adaptive pretraining on books (optional but high leverage)
Before rewrite SFT, run domain-adaptive pretraining (DAPT) on your book corpus. “Don’t Stop Pretraining” shows that continuing pretraining on in-domain data improves downstream performance in both high- and low-resource settings. (ACL Anthology)
This helps the small model learn domain vocabulary and style even without huge labeled rewrite pairs.
Step 3 — Teacher generates multiple rewrite candidates (rewrite + edit)
Generate K candidates (e.g., 5–10). Use an “editor pass” or strict prompt rules similar to InfoCQR’s approach (LLM as rewriter + editor; distillation to small model). (Diva Portal)
Step 4 — Retrieval-feedback filtering (most important upgrade)
For each candidate rewrite:
- Run your actual retriever (BM25/dense/hybrid).
- Score candidates by retrieval success (Recall@k, nDCG@k, etc.).
- Keep the best rewrite as the training label.
This aligns training with the real objective used in conversational search pipelines (CAsT). (trec.nist.gov)
If you want an “academic-style” version of retrieval-guided refinement, see GuideCQR, which uses initially retrieved documents to refine reformulations. (arXiv)
Step 5 — Train the student
Two strong alternatives (or complements) to generative rewriting
Option A: Term-based “query resolution” instead of full generation
QuReTeC models query resolution as binary term classification: decide which terms from conversation history should be added to the current query. It’s designed for limited supervision and was evaluated in TREC CAsT-style settings. (arXiv)
This often works well in domains where exact entity strings matter.
Option B: “Minimal edits” rewriting (avoid over-paraphrasing)
CRDR (Explicit Query Rewriting for Conversational Dense Retrieval) emphasizes making only necessary modifications rather than generating an entirely new query. (ACL Anthology)
Deployment tricks that often matter more than model size
Multi-rewrite retrieval
Generate several rewrites (beam search), retrieve for each, then merge/rerank. A SIGIR 2024 paper explores generating multiple rewrites and incorporating them into retrieval. (ACM Digital Library)
Improve the history before rewriting
If ambiguity is caused by messy history, CHIQ proposes refining the conversation history (e.g., coreference/expansion) before rewriting, and reports strong results on multiple benchmarks. (arXiv)
Bottom line for your pipeline
Your approach is reasonable if you tighten it into “CQR for retrieval,” and you add:
- No answer/gold leakage in teacher inputs (match runtime inputs)
- Retrieval-feedback filtering of teacher rewrites
- Start from Flan-T5-small, optionally preceded by domain-adaptive pretraining on your books (arXiv)
- Consider a hybrid with QuReTeC (term expansion) when exact domain strings are important (arXiv)