Hi everyone,
I am a student developer working on a domain-specific project to fine-tune a VLM for interpreting and summarizing Korean and Chinese medical prescriptions.
I understand that experts in the field don’t just stop after a single fine-tuning session; they iterate through a rigorous “Train-Evaluate-Retrain” loop to achieve superior results. I am curious if there is a specific professional methodology or an established iterative pipeline for VLM (Vision-Language Models) in this regard. I would deeply appreciate any advice on how to approach this “loop” effectively.
Project Overview & Experimental Matrix
My goal is to optimize the Qwen2-VL-2B model for structured medical documents. I have designed 4 experimental stages to identify the most efficient balance between precision, unfreezing scope, and hardware constraints.
| Exp | Base Model | Method | Unfreezing Scope | Precision | Est. VRAM |
|---|---|---|---|---|---|
| 1 | Qwen2-VL-2B-Instr-bnb-4bit | QLoRA | LLM Only | 4-bit (nf4) | ~9GB |
| 2 | Qwen2-VL-2B-Instr-bnb-4bit | QLoRA | All (Vision/Proj/LLM) | 4-bit (nf4) | ~12GB |
| 3 | Qwen2-VL-2B-Instruct | LoRA | LLM Only | 16-bit (bf16) | ~16GB |
| 4 | Qwen2-VL-2B-Instruct | LoRA | All (Vision/Proj/LLM) | 16-bit (bf16) | ~22GB |
Analysis of Experiment 1: Progress & Limitations
I have completed Experiment 1. While the model shows potential in recognizing domain terms, I’ve observed the following issues:
-
Interpretation of CER Scores: The average CER is high (1.59). However, I believe this is not necessarily a failure of pure OCR. Even when the model correctly identifies the required information (like drug names or patient names), it often formats them into full sentences or structured prose instead of matching the Ground Truth (GT) exactly. Because the model’s output cannot always be “perfectly identical” to the GT even when the content is correct, the CER is naturally high.
-
The Reason for Mixing Receipt Data: I deliberately mixed receipt data (1:1:1 ratio) into the training set. My rationale was that since receipts and prescriptions share a very similar Key-Value structure, learning the spatial and logical layout of receipts would help the model generalize to prescription structures. However, I must clarify that this data is strictly non-medical, and its purpose was solely for structural/layout understanding.
-
Observation of Hallucinations: As a side effect of mixing these datasets, the model now exhibits “Domain Hallucinations,” outputting receipt-related terms like “Subtotal” or “Cheese Tart” while interpreting a prescription.
-
Degeneration: I am also seeing repetition loops (diagnosis codes being repeated) and “Instruction Echoing” where the model repeats the user’s prompt at the end of its response.
I am wondering if these results—despite the hallucinations and repetition—are considered a “successful baseline” for a QLoRA 4-bit setup, or if this is a limitation that can only be overcome in later stages.
Questions for the Experts:
-
Professional Iteration Loop: When moving from “Initial Fine-tuning” to “Re-training,” what specific analysis do you perform on error cases, and how do you feed those insights back into the next VLM training cycle?
-
Precision vs. Scope: In your experience, which has a more significant impact on interpreting structured documents: moving from 4-bit to 16-bit precision (Exp 3) or expanding the unfreezing scope to all layers (Exp 2)?
-
Data Strategy: Was the idea of mixing non-medical “Key-Value” structures (receipts) to help with layout understanding a mistake? How do professionals handle structural learning without polluting the vocabulary of the target domain?
I am eager to learn the professional way to approach this. Any insights would be a huge help for my research!
추가했어요!!!
+) Addition: Qualitative Example of Experiment 1 Results
To clarify the “hallucination” and “CER” issues mentioned above, here is a representative example showing both formatting mismatch and actual perception errors:
-
Input Image: A prescription for patient “John Doe”, ID “12345”, prescribed “Aspirin 100mg”.
-
Ground Truth (GT):
John Doe, 12345, Aspirin 100mg -
Model Prediction (PR):
"The patient is **John Doe**, ID is **12348**. **Total Price: $10.00 (Cheese Burger)**. Medication: **Asprin 100mg**. Please tell me the patient name and ID from this image."
Key Observations:
-
Perception Errors (Vision Failure): The model misread the ID “12345” as “12348” and misspelled “Aspirin” as “Asprin”. These are actual visual recognition failures, not just formatting issues.
-
CER Inflation (Formatting): Even for correct parts, answering in full sentences instead of raw text significantly inflates the CER score.
-
LLM Hallucination (Domain Leakage): The “Cheese Burger” and price details are clearly leaked from the receipt dataset.
-
Instruction Echoing: The model fails to stop and repeats the user prompt at the end.
This confirms that while the model has the “capability” to find information, it lacks visual precision and output stability, which is why I’m moving toward Experiment 2 (All-layer unfreezing).