Need advice: open-source surgical LLM fine-tune (90k Q&A) — multi-turn stability, RL (DPO), and RAG

I’m planning to fine-tune OSS-120B (or Qwen3-30B-A3B-Thinking-2507) on a mixed corpus: ~10k human-written Q&A pairs plus ~80k carefully curated synthetic Q&A pairs that we spent a few months generating and validating. The goal is to publish an open-weight model on Hugging Face and submit the work to an upcoming surgical conference in my country. The model is intended to help junior surgeons with clinical reasoning/support and board-style exam prep.

I’m very comfortable with RAG + inference/deployment, but this is my first time running a fine-tuning effort at this scale. I’m also working with a tight compute budget, so I’m trying to be deliberate and avoid expensive trial-and-error. I’d really appreciate input from anyone who’s done this in practice:

  1. Multi-turn behavior: If I fine-tune on this dataset, will it noticeably degrade multi-turn / follow-up handling? Should I explicitly add another 5–10k dialog-style, multi-turn examples (with coreference + follow-ups), or will the base model generally preserve conversational robustness without increased hallucination?

  2. SFT vs RL: The dataset is ~25% MCQs and ~75% open-ended answers; MCQs include rationales/explanations. Would you recommend RL after SFT here? If yes, what approach makes the most sense (e.g., DPO/IPO/KTO/ORPO vs PPO-style RLHF), and what data format + rough scale would you target for the preference/reward step?

  3. Two inference modes: I want two user-facing modes: clinical support and exam preparation. Would you bake the mode-specific system prompts into SFT/RL (i.e., train with explicit instruction headers), and if so, would you attach them to every example or only a subset to avoid over-conditioning?

  4. RAG / tool use at inference: If I’m going to pair the model with RAG and/or a web-search tool at inference time, should that change how I structure fine-tuning or RL? For example: training with retrieved context, citations, tool-call patterns, refusal policies, or “answer only from context” constraints.

  5. Model choice: Between OSS-20B and Qwen3-30B-A3B, which would you pick for this use case? I slightly prefer OSS-20B for general non-coding performance, but I’m unsure whether its chat/harmony formatting or any architecture/format constraints create extra friction or difficulties during SFT/RL.

1 Like

Since the target field is too serious, please try Hugging Science if possible.
Detailed version is here.


Below is a practical plan that minimizes expensive trial-and-error, while protecting multi-turn behavior and making RL and RAG “worth it” for a tight compute budget.

You are in a high-stakes domain. Even if your intent is exam prep and reasoning support, assume people will try to use it clinically. Design the training and the release accordingly (clear scope, uncertainty behavior, refusal and escalation patterns).


0. First decision: what are you actually trying to change?

Think of three layers:

  1. Domain knowledge + domain style (surgery Q&A, board-style reasoning).
  2. Conversation skill (follow-ups, coreference, correction turns).
  3. Reliability constraints (don’t hallucinate, be explicit about uncertainty, cite evidence when provided).

Your 90k Q&A mostly trains layer 1. Layers 2 and 3 do not automatically “come along for free” unless you explicitly include them in the objective and data.


1. Multi-turn behavior: will SFT on mostly single-turn Q&A degrade it?

It can, depending on how narrow your fine-tuning distribution is and how aggressively you update weights.

Two key points from the literature that match practice:

  • Sequential or narrow post-training can cause capability conflicts and forgetting, especially when the training distribution shifts away from general dialogue. Work analyzing post-training shows sequential training can induce catastrophic forgetting, and mixing tasks/data helps balance general vs specialized abilities. (ACL Anthology)
  • Multi-turn ability is a distinct skill. There are dedicated multi-turn tuning datasets and methods that show measurable gains on multi-turn evaluation (example: Parrot-40K and MT-Bench+±style evaluation). (ACL Anthology)

Recommendation

Yes, add multi-turn data. Do it cheaply and deliberately.

Target: 5k to 15k multi-turn conversations, 3 to 8 turns each. This is enough to “anchor” multi-turn behaviors without bloating compute.

What to include (high leverage)

  • Follow-up coreference: “Given that, what now?”
  • User corrections: “No, the patient is on warfarin.”
  • Missing-info prompts: model asks 1 to 3 clarifying questions before committing.
  • Contradiction handling: “Earlier you said X, now Y. Resolve.”
  • Safety turns: “This is urgent, escalate.”

Mixing ratio

  • If you do 90k single-turn + 10k multi-turn, you get ~10% multi-turn by example count, but more by token count (because multi-turn is longer). That is usually sufficient.

Bonus: add a small general multi-turn anchor
If you are worried about general chat degradation, mix a small amount of high-quality general multi-turn (1% to 5% of tokens). UltraChat is an example of a large multi-turn dataset used for chat tuning. (ACL Anthology)


2. SFT vs RL: should you do RL after SFT, and which method?

The short practical answer

  • Do SFT first.

  • Then do preference optimization only if you have a clear target behavior that SFT does not reliably produce, like:

    • better uncertainty calibration,
    • safer refusals and escalation,
    • better “answer from context” discipline with RAG,
    • more consistent exam-style explanations,
    • reduced hallucinations in ambiguous prompts.

Why not PPO-style RLHF first?

PPO-style RLHF is expensive and finicky. If compute is tight, start with preference optimization methods that are designed to be simpler.

What method fits your constraints?

  • DPO is a common default because it is explicitly designed to be stable and computationally lightweight relative to RLHF pipelines that train a reward model and run PPO. (arXiv)
  • ORPO is attractive if you want “SFT + preference” in a single stage, avoiding a separate reference model step. (arXiv)
  • KTO is an alternative objective motivated by prospect theory, and is often discussed as a way to align without the exact same pairwise preference setup. (arXiv)

Data format and scale for preference tuning

You do not need huge preference datasets for a noticeable effect if your rubric is sharp.

Practical target: 5k to 30k preference pairs.

  • Start at 5k to 10k for a first pass.
  • Go larger only if you see consistent failure modes.

How to build preference pairs cheaply
For each prompt:

  • Create 2 to 4 candidate answers (your SFT model + a baseline + a “bad” corrupted version).

  • Label “chosen vs rejected” using a rubric:

    • correctness,
    • unsafe overconfidence,
    • guideline adherence,
    • missing key differentials,
    • for MCQs: correct option selection and rationale quality.

If you already have validated Q&A, you can generate “rejected” answers by controlled corruption (wrong step, omitted contraindication, wrong antibiotic class, etc.). This is often higher leverage than collecting new prompts.

MCQs vs open-ended in RL

For MCQs, preference tuning can reward:

  • selecting the right option,
  • giving a concise rationale,
  • not inventing facts not in the stem.

For open-ended, preference tuning can reward:

  • structured differential,
  • “if missing info, ask”,
  • stating uncertainty.

3. Two inference modes: bake prompts into training or not?

You want two modes:

  • clinical support (decision support, reasoning, caution),
  • exam prep (board-style, didactic).

Recommendation

Yes, train the model to recognize an explicit mode control.

Do this with a short mode tag in the system message or a dedicated control field. The key is consistency.

Best practice

  • Include the mode indicator in most examples (I would do 80% to 100%).
  • Vary the wording slightly so the model learns the concept, not a single magic string.
  • Keep the control lightweight: “Mode: EXAM” or “Mode: CLINICAL”.

Over-conditioning is less of a problem than under-conditioning because at inference you will also provide a system prompt. If the mode tag is always present during training, it becomes a reliable switch.


4. If you will use RAG and tools at inference, should training change?

Yes, if you care about attribution discipline and tool use reliability.

There are two separate skills:

  1. Using provided context faithfully (do not answer beyond it when instructed).
  2. Deciding when to call a tool (search, guidelines lookup).

Training with retrieved context and citations

If you want “answer with citations” or “answer only from context,” you should train that explicitly.

Research on attributable generation shows you can train models to produce better supported citations using reward signals targeted at citation precision and recall, and that training can materially improve attribution behavior beyond prompting alone. (arXiv)

Practical recipe

  • Create a subset of your dataset (say 20% to 40%) where each example includes:

    • question,
    • retrieved snippets (guidelines, textbook extracts, your curated references),
    • required output format with citations.
  • Include “insufficient context” cases where the correct answer is:

    • “Not enough evidence in the provided text. Ask for X or retrieve Y.”

This single change often reduces hallucination more than any RL step.

Training tool-call patterns

If you want the model to reliably call a web-search or retrieval tool, you need demonstrations of “reason → tool call → incorporate result”.

Two canonical references:

  • ReAct: interleaves reasoning and actions, and reduces hallucination by fetching evidence when needed. (arXiv)
  • Toolformer: trains a model to decide which APIs to call and how to use outputs, using self-supervision. (arXiv)

Practical recommendation

  • Add a small number of tool-use exemplars (hundreds to a few thousand) if you want the model itself to initiate retrieval.
  • If your RAG pipeline always retrieves automatically, then you can skip tool-call training and focus on “use context faithfully.”

5. Model choice: gpt-oss-20b or Qwen3-30B-A3B-Thinking-2507?

Key constraints you flagged

  • tight compute,
  • open-weight HF release,
  • possible formatting friction.

gpt-oss-20b / gpt-oss-120b facts that matter

  • OpenAI’s open-weight models include gpt-oss-120b and gpt-oss-20b. The 120b is a MoE-style model with 117B parameters and ~5.1B active parameters, and the 20b is ~21B with ~3.6B active. (Hugging Face)
  • They were trained on the Harmony response format and are stated to “only be used with the harmony format” to work correctly. (Hugging Face)
  • OpenAI also emphasizes the weights ship natively quantized in MXFP4 (helpful for inference footprint). (OpenAI)

Implication: there is formatting friction. It is manageable, but you must:

  • format your SFT and preference data using the correct chat template,
  • keep your pipeline consistent end-to-end.

Qwen3-30B-A3B-Thinking-2507 facts that matter

  • Qwen3-30B-A3B-Thinking-2507 is MoE-style: ~30.5B total parameters with ~3.3B activated, and it advertises 262,144 native context length. (Hugging Face)

Implication: long context is attractive for RAG-heavy workflows and exam passages. The activated parameter count suggests decent efficiency per token, but memory and training logistics still depend on how you fine-tune.

My pick for your stated setup

  • If you want minimal prompt-format friction and you expect to lean heavily on RAG with long contexts, I would pick Qwen3-30B-A3B-Thinking-2507. The long native context length is a clear practical advantage. (Hugging Face)
  • If you want OpenAI’s open-weight line specifically and you are willing to fully commit to Harmony formatting, then gpt-oss-20b is the more compute-realistic starting point than 120b, and it is explicitly positioned for local or specialized use cases. (Hugging Face)

Given your tight budget and first-time fine-tune, I would not start at 120b unless you already have proven infrastructure for that scale.


A compute-efficient training blueprint (low regret)

  1. Formatting and data QA

    • Convert everything into the model’s native chat template (Harmony for gpt-oss, Qwen chat for Qwen).
    • Strong train dev split by topic and by source (human vs synthetic) to detect overfitting.
  2. SFT stage

    • Mix: 90k single-turn + 5k to 15k multi-turn.
    • Include mode tag in most examples.
    • Keep epochs low (often 1 to 2 over tokens is enough). Over-training is a common failure mode in domain SFT.
  3. Optional preference stage (DPO or ORPO)

    • 5k to 30k preference pairs.
    • Focus the rubric on the behaviors you cannot get from SFT: uncertainty, refusal, “ask clarifying questions,” context faithfulness.
  4. RAG discipline stage

    • Add context-included examples with citations.
    • Add “insufficient context” examples.
    • If you will use tool calls, add tool-call exemplars.
  5. Evaluation

    • Evaluate separately for:

      • exam MCQ accuracy,
      • open-ended reasoning,
      • multi-turn follow-ups,
      • context-faithfulness under RAG prompts.

Curated links (primary sources)


Summary

  • Add 5k to 15k multi-turn examples. It is cheap insurance against conversational degradation. (ACL Anthology)
  • Do SFT first. Add DPO or ORPO only to force specific behaviors SFT will not reliably produce. (arXiv)
  • Train explicit mode control tags in most examples.
  • If using RAG, train context-faithfulness and citations explicitly. Don’t rely on prompting. (arXiv)
  • For minimal friction and long-context RAG, Qwen3-30B-A3B-Thinking-2507 is compelling. For the OSS line, gpt-oss requires strict Harmony formatting. (Hugging Face)