Seeking Professional Methodology for VLM Domain Fine-tuning: Analyzing 4 Experimental Strategies with Qwen2-VL

Hi everyone,

I am a student developer working on a domain-specific project to fine-tune a VLM for interpreting and summarizing Korean and Chinese medical prescriptions.

I understand that experts in the field don’t just stop after a single fine-tuning session; they iterate through a rigorous “Train-Evaluate-Retrain” loop to achieve superior results. I am curious if there is a specific professional methodology or an established iterative pipeline for VLM (Vision-Language Models) in this regard. I would deeply appreciate any advice on how to approach this “loop” effectively.

Project Overview & Experimental Matrix

My goal is to optimize the Qwen2-VL-2B model for structured medical documents. I have designed 4 experimental stages to identify the most efficient balance between precision, unfreezing scope, and hardware constraints.

Exp Base Model Method Unfreezing Scope Precision Est. VRAM
1 Qwen2-VL-2B-Instr-bnb-4bit QLoRA LLM Only 4-bit (nf4) ~9GB
2 Qwen2-VL-2B-Instr-bnb-4bit QLoRA All (Vision/Proj/LLM) 4-bit (nf4) ~12GB
3 Qwen2-VL-2B-Instruct LoRA LLM Only 16-bit (bf16) ~16GB
4 Qwen2-VL-2B-Instruct LoRA All (Vision/Proj/LLM) 16-bit (bf16) ~22GB

Analysis of Experiment 1: Progress & Limitations

I have completed Experiment 1. While the model shows potential in recognizing domain terms, I’ve observed the following issues:

  • Interpretation of CER Scores: The average CER is high (1.59). However, I believe this is not necessarily a failure of pure OCR. Even when the model correctly identifies the required information (like drug names or patient names), it often formats them into full sentences or structured prose instead of matching the Ground Truth (GT) exactly. Because the model’s output cannot always be “perfectly identical” to the GT even when the content is correct, the CER is naturally high.

  • The Reason for Mixing Receipt Data: I deliberately mixed receipt data (1:1:1 ratio) into the training set. My rationale was that since receipts and prescriptions share a very similar Key-Value structure, learning the spatial and logical layout of receipts would help the model generalize to prescription structures. However, I must clarify that this data is strictly non-medical, and its purpose was solely for structural/layout understanding.

  • Observation of Hallucinations: As a side effect of mixing these datasets, the model now exhibits “Domain Hallucinations,” outputting receipt-related terms like “Subtotal” or “Cheese Tart” while interpreting a prescription.

  • Degeneration: I am also seeing repetition loops (diagnosis codes being repeated) and “Instruction Echoing” where the model repeats the user’s prompt at the end of its response.

I am wondering if these results—despite the hallucinations and repetition—are considered a “successful baseline” for a QLoRA 4-bit setup, or if this is a limitation that can only be overcome in later stages.


Questions for the Experts:

  1. Professional Iteration Loop: When moving from “Initial Fine-tuning” to “Re-training,” what specific analysis do you perform on error cases, and how do you feed those insights back into the next VLM training cycle?

  2. Precision vs. Scope: In your experience, which has a more significant impact on interpreting structured documents: moving from 4-bit to 16-bit precision (Exp 3) or expanding the unfreezing scope to all layers (Exp 2)?

  3. Data Strategy: Was the idea of mixing non-medical “Key-Value” structures (receipts) to help with layout understanding a mistake? How do professionals handle structural learning without polluting the vocabulary of the target domain?

I am eager to learn the professional way to approach this. Any insights would be a huge help for my research!

:smiley: 추가했어요!!!


+) Addition: Qualitative Example of Experiment 1 Results

To clarify the “hallucination” and “CER” issues mentioned above, here is a representative example showing both formatting mismatch and actual perception errors:

  • Input Image: A prescription for patient “John Doe”, ID “12345”, prescribed “Aspirin 100mg”.

  • Ground Truth (GT): John Doe, 12345, Aspirin 100mg

  • Model Prediction (PR): "The patient is **John Doe**, ID is **12348**. **Total Price: $10.00 (Cheese Burger)**. Medication: **Asprin 100mg**. Please tell me the patient name and ID from this image."

Key Observations:

  1. Perception Errors (Vision Failure): The model misread the ID “12345” as “12348” and misspelled “Aspirin” as “Asprin”. These are actual visual recognition failures, not just formatting issues.

  2. CER Inflation (Formatting): Even for correct parts, answering in full sentences instead of raw text significantly inflates the CER score.

  3. LLM Hallucination (Domain Leakage): The “Cheese Burger” and price details are clearly leaked from the receipt dataset.

  4. Instruction Echoing: The model fails to stop and repeats the user prompt at the end.

This confirms that while the model has the “capability” to find information, it lacks visual precision and output stability, which is why I’m moving toward Experiment 2 (All-layer unfreezing).

1 Like

For now, I did a little test in Colab. (Detailed version)


A professional Train → Evaluate → Retrain loop for VLM document extraction

0) Define the target as two separate problems

  1. Contract adherence: “Output is valid schema-only JSON, nothing else.”
  2. Content fidelity: “Values match what is visually present (digits, spellings, units).”

Treat these as separate metrics and separate fixes. Constrained decoding can almost eliminate (1), so evaluation can focus on (2). vLLM documents structured outputs (guided decoding) with JSON-schema backends including lm-format-enforcer and xgrammar. (vLLM)


1) Evaluate like a production parser (contract-first), not like OCR string matching

Why CER looks “bad” even when content is partly right

CER/Levenshtein on raw text punishes:

  • extra prose
  • instruction echo
  • markdown fences
  • extra keys / reordered keys

For structured extraction, use a contract gate before any field scoring:

  • Strict JSON parse
  • Exact key set (no extras)
  • Type checks (strings or null)
    Then compute field-level exact match and numeric-specific metrics.

For “soft” scoring on non-numeric text fields (drug/patient name), ANLS is widely used in document QA because it tolerates small OCR-like errors; it’s the standard metric used in DocVQA-style evaluations. (Hugging Face)

Recommended metric stack

  • Contract: strict-parse rate, exact-key-set rate

  • “Salvage” (debug only): rate where you can recover the first JSON object even if extra text appears

  • Fields:

    • patient_id: exact match only (string-exact)
    • drug_name, patient_name: EM + ANLS (optional)
    • strength: EM + targeted unit/dose correctness checks

2) Make evaluation reflect “true perception” by eliminating formatting drift

Use schema-constrained decoding in eval (and optionally inference)

  • With JSON-schema guided decoding, the model can’t output prose/markdown or extra keys, so failures become mostly content (digit flips, misspellings, wrong units). vLLM supports structured outputs using backends like lm-format-enforcer and xgrammar. (vLLM)
  • If you aren’t using vLLM, lm-format-enforcer is a common option for JSON-schema enforcement in Python inference stacks. (arXiv)

Critical decoding detail

Decode generated tokens only, not prompt+generation, otherwise “strict JSON only” can fail due to prompt text leaking into the decode.


3) Prevent instruction echoing and run-on generation at training time

Completion-only / assistant-only loss

Train loss only on assistant spans (the completion), not on the prompt. TRL documents assistant_only_loss=True for this purpose. (Hugging Face)

Common pitfall: if the chat template lacks {% generation %} markers, assistant masks can become all zeros and the loss masking can be wrong. This is a known failure mode discussed in TRL issues. (GitHub)

Deterministic stopping

Even with structured decoding, models sometimes continue generating after a valid JSON object. A robust fix is to stop immediately after the first complete top-level object (brace matching) or enforce a stop sequence if your runtime supports it.


4) Treat “domain leakage” as a dataset-conditioning problem

Was mixing receipts a mistake?

Not necessarily. It’s a valid idea for layout/Key–Value structure learning, but it must be conditioned so the model knows which domain it is in. Otherwise, it will learn that receipt vocabulary is frequently “correct” under uncertainty and will leak it.

Professional mitigations (ordered by cost/effectiveness):

  1. Domain routing token / tag in the prompt (“PRESCRIPTION” vs “RECEIPT”), always present during train+eval+infer.
  2. Separate adapters per domain (two LoRA/QLoRA adapters) and route at inference time.
  3. If you keep mixed training: re-balance sampling so medical dominates, and add “hard negatives” where non-medical images map to all-null schema outputs (explicitly teaches “don’t guess”).
  4. Add a leakage metric (keyword list) and mine those failures into the next retrain set.

5) Precision vs unfreezing scope: what usually matters most for your failure modes

4-bit QLoRA vs 16-bit LoRA

QLoRA (NF4) is explicitly designed to match 16-bit fine-tuning quality on many tasks while reducing memory significantly. The original paper reports that 4-bit NF4 QLoRA can match 16-bit baselines in many settings. (arXiv)
So moving 4-bit → 16-bit often helps less than expected if your real bottlenecks are:

  • wrong loss masking / prompt-template mismatch
  • evaluation contaminated by formatting drift
  • domain leakage from mixed data
  • insufficient visual detail for small text

Unfreezing vision/projection vs LLM-only

For true perception errors (digit flips, small text), unfreezing some vision-side components can help, but it is:

  • more VRAM expensive
  • more data-hungry
  • higher risk of destabilizing a small setup

A common professional compromise:

  • keep vision encoder frozen
  • train LLM + projector, or LLM-only, then only later try partial vision unfreezing (e.g., last N blocks) if content errors persist after fixing evaluation, stopping, and data conditioning.

Also, Qwen2-VL uses dynamic resolution and converts images into a variable number of visual tokens; simply increasing visual token budget (within safe caps) can improve small-text perception without full unfreezing. (arXiv)


6) The loop: what to analyze each cycle, and how it feeds retraining

A) Error taxonomy (you want a “failure reason” per sample)

  1. Contract failures

    • parse fail
    • wrong keys / extra keys
    • type mismatch
  2. Generation failures

    • run-on after JSON
    • repetition loops
    • instruction echo
  3. Content failures

    • digit flip (ID)
    • near-miss spelling (drug)
    • unit confusion (mg/ng)
  4. Domain leakage

    • receipt keywords present

B) Turn analysis into targeted retraining actions

  • Contract/generation failures → fix decoding + masking + stop rules first (cheap, high leverage)
  • Digit/unit errors → curate a “numeric challenge set” and oversample it; consider modestly increasing max_pixels (watch sequence length) (Hugging Face)
  • Leakage → domain tag + negatives + re-weight sampling; optionally split adapters

C) “Failure mining” mechanics (practical)

After every eval:

  • collect top-N samples per failure type

  • add them to:

    • a fixed regression set (never changes)
    • a “next-cycle hard set” (changes each cycle)
      Then retrain with oversampling of the hard set until that slice improves.

Interpreting your Experiment 1 as a baseline

Given your example, you have multiple independent issues:

  • Format drift + instruction echo → usually a training/masking/template + stopping problem (not a “4-bit limitation”) (Hugging Face)
  • Receipt vocabulary leakage → dataset conditioning problem (domain tag / negatives / routing)
  • Digit flip + misspelling → perception/detail + targeted data problem (possibly helped by a controlled increase in visual tokens) (arXiv)

So it can be a useful baseline only if you first re-run evaluation with:

  • schema-constrained decoding (vLLM)
  • generated-token-only decoding
  • assistant-only loss masking verified (Hugging Face)
  • deterministic stop after JSON

If those changes sharply improve “strict contract pass” and eliminate echo/run-on, then remaining errors are genuinely perception/content—and then it’s worth exploring scope changes (projector/partial vision unfreeze) and visual-budget sweeps.


Concrete recommendation for your next two iterations (T4-friendly)

Iteration 1 (cheap, correctness-first)

  • Add strict contract gates + salvage logging

  • Eval A/B:

    • A: unconstrained
    • B: JSON-schema constrained decoding (vLLM)
  • Fix assistant-only loss masking and verify template supports it (Hugging Face)

  • Add deterministic stop after first JSON object

Iteration 2 (content-first)

  • Add domain tag + receipt negatives (all-null outputs)
  • Build numeric challenge slice; sweep max_pixels in small steps (watch truncation/seq length) (Hugging Face)
  • Only if digit errors persist after visual sweep: try projector + limited vision unfreezing (small ablation)