Using SFT on T5-small for domain-aware question completion—does this pipeline make sense?

I’m planning to use a small model in my RAG system to perform question completion based on historical Q&A. My current approach is as follows:

  1. Clean domain-specific book data to construct a dataset containing multiple questions that lead to a final answer;

  2. Use a powerful model to complete the questions in this dataset;

  3. Combine this completed dataset with a small amount of general-domain (non-domain-specific) question-completion data and fine-tune T5-small via supervised fine-tuning (SFT).

The goal is to obtain a small model specialized for question completion in this specific domain. Does this approach sound reasonable? Any potential pitfalls or suggestions for improvement?

1 Like

Seems make sense but there are pitfalls…?


Framing your task correctly

What you call “question completion” is usually treated in the IR/RAG literature as question-in-context rewriting / conversational query rewriting (CQR):

  • Input: conversation history + current underspecified question
  • Output: a standalone question/query that can be sent to your retriever

This is the exact task defined by CANARD (“rewrite a context-dependent question into a self-contained question with the same answer”). (ACL Anthology)
It’s also a core subtask in QReCC, which evaluates rewriting + retrieval + answering end-to-end. (arXiv)
TREC CAsT baselines and team reports commonly use T5 fine-tuned on CANARD as the rewriting module in a conversational search pipeline. (trec.nist.gov)

That matters because the “best” completion isn’t the most fluent one—it’s the one that improves retrieval and downstream answer quality.


Does your 3-step pipeline make sense?

Yes, as a standard “teacher → synthetic labels → small student” approach

Your plan is a common pattern: use a powerful model to create rewrite labels, then SFT a small seq2seq model to reproduce them cheaply.

There are papers and open projects explicitly doing this for CQR:

  • InfoCQR prompts LLMs as rewriters (and editors), then distills to smaller models for efficiency. (Diva Portal)
  • SynRewrite (2025) constructs synthetic rewrites with GPT-4o and fine-tunes a Flan-T5 rewriter, then further aligns it using downstream feedback (DPO). (arXiv)

So the direction is reasonable.

The main condition: match what your rewriter will see at runtime

The strongest failure mode is training–inference mismatch: if the teacher uses information (gold answer, gold supporting passages, etc.) that your deployed rewriter won’t have, the student learns to “cheat” by injecting answer-y terms.


The biggest pitfalls in your specific plan

Pitfall 1: “multi questions that lead to a final answer” can accidentally leak the answer

If your constructed chains include later-turn facts too early (or include the final answer text in the context), the teacher will generate rewrites that embed answer-specific entities. That can look great offline but can hurt real usage.

Fix

  • When generating labels, only provide exactly the same inputs the deployed rewriter will have (typically: recent dialogue turns + current question). Keep “final answer” out of the teacher input unless you also plan to provide it at runtime.

Pitfall 2: Book-derived “question chains” may not resemble real user follow-ups

Books are good domain text, but they are a weak proxy for conversational phenomena (pronouns, ellipsis, topic drift, corrections). CANARD/QReCC are valuable partly because they represent these phenomena explicitly. (ACL Anthology)

Fix

  • Use your historical Q&A to anchor conversational structure.
  • Use books more for domain adaptation (see DAPT/TAPT below), terminology coverage, and retrieval corpus—not as the primary generator of dialogue dynamics.

Pitfall 3: Optimizing for “nice” rewrites instead of retrieval-effective rewrites

Human rewrites can omit retrieval-critical context; LLM rewrites can be more informative and improve retrieval. InfoCQR is explicitly motivated by this mismatch and proposes “informative” rewrites plus distillation. (Diva Portal)

Fix

  • Select/reward rewrites based on retrieval metrics (Recall@k / MRR / nDCG), not just text similarity.

Pitfall 4: Hallucinated specificity from the teacher becomes your student’s behavior

LLMs “helpfully” add constraints not justified by context.

Fix

  • Generate K candidates per instance and filter.
  • Add explicit constraints to the teacher prompt: no new entities unless present in the given context; preserve intent; keep it short.

Pitfall 5: T5-small may be harder to steer than you expect

For instruction-like behavior (“rewrite into a standalone question”), an instruction-tuned checkpoint is often easier to fine-tune than vanilla T5.

Fix

  • Start from Flan-T5-small (released in the instruction-tuning work “Scaling Instruction-Finetuned Language Models”). (arXiv)

What I would do instead (a robust version of your pipeline)

Step 0 — Define the contract and add “NO_REWRITE”

Make your model learn:

  • If the question is already standalone → output unchanged.
  • If the context is insufficient → output unchanged (or a special token you handle upstream).

This matches how CQR datasets treat self-contained turns (rewrite can equal original). (assets.amazon.science)


Step 1 — Build an evaluation set before scaling synthetic data

Create a small dev/test set of real examples from your historical Q&A:

  • history + follow-up + gold rewrite
  • and ideally “known relevant chunk(s)” for retrieval evaluation (even partial)

Use CANARD/QReCC/TREC CAsT concepts as templates for what “good” looks like. (Google Sites)


Step 2 — Domain-adaptive pretraining on books (optional but high leverage)

Before rewrite SFT, run domain-adaptive pretraining (DAPT) on your book corpus. “Don’t Stop Pretraining” shows that continuing pretraining on in-domain data improves downstream performance in both high- and low-resource settings. (ACL Anthology)

This helps the small model learn domain vocabulary and style even without huge labeled rewrite pairs.


Step 3 — Teacher generates multiple rewrite candidates (rewrite + edit)

Generate K candidates (e.g., 5–10). Use an “editor pass” or strict prompt rules similar to InfoCQR’s approach (LLM as rewriter + editor; distillation to small model). (Diva Portal)


Step 4 — Retrieval-feedback filtering (most important upgrade)

For each candidate rewrite:

  1. Run your actual retriever (BM25/dense/hybrid).
  2. Score candidates by retrieval success (Recall@k, nDCG@k, etc.).
  3. Keep the best rewrite as the training label.

This aligns training with the real objective used in conversational search pipelines (CAsT). (trec.nist.gov)

If you want an “academic-style” version of retrieval-guided refinement, see GuideCQR, which uses initially retrieved documents to refine reformulations. (arXiv)


Step 5 — Train the student

  • Base: Flan-T5-small (arXiv)

  • Training mix:

    • retrieval-filtered domain synthetic
    • a small amount of general CQR data (CANARD/QReCC-style) to keep robustness (ACL Anthology)

Two strong alternatives (or complements) to generative rewriting

Option A: Term-based “query resolution” instead of full generation

QuReTeC models query resolution as binary term classification: decide which terms from conversation history should be added to the current query. It’s designed for limited supervision and was evaluated in TREC CAsT-style settings. (arXiv)

This often works well in domains where exact entity strings matter.

Option B: “Minimal edits” rewriting (avoid over-paraphrasing)

CRDR (Explicit Query Rewriting for Conversational Dense Retrieval) emphasizes making only necessary modifications rather than generating an entirely new query. (ACL Anthology)


Deployment tricks that often matter more than model size

Multi-rewrite retrieval

Generate several rewrites (beam search), retrieve for each, then merge/rerank. A SIGIR 2024 paper explores generating multiple rewrites and incorporating them into retrieval. (ACM Digital Library)

Improve the history before rewriting

If ambiguity is caused by messy history, CHIQ proposes refining the conversation history (e.g., coreference/expansion) before rewriting, and reports strong results on multiple benchmarks. (arXiv)


Bottom line for your pipeline

Your approach is reasonable if you tighten it into “CQR for retrieval,” and you add:

  1. No answer/gold leakage in teacher inputs (match runtime inputs)
  2. Retrieval-feedback filtering of teacher rewrites
  3. Start from Flan-T5-small, optionally preceded by domain-adaptive pretraining on your books (arXiv)
  4. Consider a hybrid with QuReTeC (term expansion) when exact domain strings are important (arXiv)