LLM for medical imaging

Hi all!

I have built a research platform for viewing medical DICOM images with LLM support (QVoxl.io)

The LLM take CT slices (png) and text as input and output text

Any suggestions which models I should try to use?

Johan

1 Like

Since this is a medical field, I recommend consulting Hugging Science as well. If the input image is singular, you can simply use a good VLM. However, if the input images are multiple:


What you are building, in model terms

Your interface (“CT slices as PNG + prompt → text”) is effectively a multi-image vision-language problem with two extra challenges:

  1. CT is 3D, but you’re feeding 2D (hundreds of slices compressed into a small set of images).
  2. CT meaning depends on intensity handling (Hounsfield Units + windowing). If your PNG export is off, even the best model will fail.

So the best strategy is to evaluate models in tiers:

  • Tier A (general VLM baselines): easiest to integrate; best for validating your slice packaging and UX.
  • Tier B (medical/radiology VLMs): better medical language priors; often more brittle.
  • Tier C (CT/3D-native research models): closest to “study-level CT understanding,” but typically requires different preprocessing than simple PNG slices.

Tier A — Open, strong “works now” VLM baselines (start here)

These are the models I would try first because they are strong general VLMs and commonly used as baselines.

1) Qwen2.5-VL (Instruct) — strong, recent open family

  • Why for your case: good all-around vision-language performance; practical baseline to test multi-slice prompting and structured outputs.
  • The official Hugging Face collection shows Qwen2.5-VL updated through Dec 31, 2025. (Hugging Face)
  • Example model card (72B instruct): (Hugging Face)

When to use: primary baseline if you can host 7B/32B/72B variants for quality/latency comparisons.

2) Qwen3-VL — newer generation in Transformers docs

  • Why for your case: documented as a newer series with dense + MoE and “Instruct” + “Thinking” variants; useful if you want better visual reasoning while keeping open tooling. (Hugging Face)

When to use: if you want “latest-ish open family” with clean integration via Transformers.

3) Idefics3-8B-Llama3 — explicitly designed for arbitrary sequences of images

  • Why for your case: your input is multiple slices; this model explicitly supports “arbitrary sequences of image and text inputs and produces text outputs.” (Hugging Face)

When to use: as the “multi-image robustness” baseline (especially if you pass >10 images or multiple montages).

4) InternVL2.5 — strong open multimodal family

  • Why for your case: a well-known open multimodal family with multiple sizes and quantized variants; good for cross-checking if failures are “your packaging” vs “model limitation.”
  • HF collection updated Sep 28, 2025. (Hugging Face)

When to use: as a second baseline alongside Qwen/Idefics.

5) Pixtral-12B — mid-size high-quality baseline

  • Why for your case: a clean mid-size VLM option; good quality/compute tradeoff.
  • Model card notes 12B parameters + a 400M vision encoder. (Hugging Face)

When to use: if you want a strong model around the 10–15B class for interactive UI.

6) Llama 3.2 Vision Instruct — ecosystem-friendly baseline

  • Why for your case: widely supported; “text + images in / text out” model family with 11B and 90B sizes. (Hugging Face)

When to use: if you want maximum ecosystem compatibility and common deployment paths.

7) MiniCPM-V 2.6 / MiniCPM-o 2.6 — good for multi-image + low-memory experiments

  • MiniCPM-V 2.6 model card explicitly calls out multi-image support. (Hugging Face)
  • There is an int4 variant claiming lower memory usage. (Hugging Face)
  • MiniCPM-o 2.6 is presented as a strong multimodal model with evaluation claims in its card. (Hugging Face)

When to use: if you want fast iteration, quantized deployment, or want to test multi-image behavior cheaply.


Tier B — Radiology / medical VLMs (add after you have a baseline)

Medical-tuned VLMs can improve language style and some domain priors, but they also vary widely in training quality and evaluation rigor.

RadFM (radiology foundation model line)

  • The RadFM paper frames RadFM as a generalist radiology foundation effort with large-scale 2D/3D data. (Nature)
  • There is an HF repo and a GitHub repo referencing model checkpoints. (Hugging Face)

When to use: if you want radiology-oriented priors and are willing to handle research-grade setup.


Tier C — CT / 3D-native research models (closest to “study-level” CT)

If your long-term goal is “CT study understanding” rather than “slice captioning,” these papers/projects are the right background—and in some cases offer usable checkpoints.

Merlin (3D CT VLM)

  • Merlin is explicitly a “vision-language foundation model for 3D CT,” trained with CT + reports + diagnosis codes, and evaluated across many tasks. (arXiv)

When to use: as a research reference or if you want to experiment with 3D-native approaches (likely beyond pure PNG-slice chat).

CT-RATE / CT-CLIP / CT-CHAT (chest CT focused)

  • CT-RATE introduces a large chest CT dataset paired with reports and describes CT-CLIP and CT-CHAT built on it. (arXiv)
  • The CT-CLIP GitHub repo positions CT-CHAT as a 3D chest CT chat model built from CT-CLIP. (GitHub)
  • The CT-RATE dataset page contains CT-CHAT description and related assets. (Hugging Face)
  • A discussion thread mentions running CT-CHAT via provided scripts and model paths. (Hugging Face)

When to use: if your primary use case is non-contrast chest CT and you want a domain-aligned research baseline.

CT-Agent (agentic framework for CT QA)

  • CT-Agent is specifically about handling CTQA by decomposing anatomy and using a global-local token compression strategy, evaluated on CT-RATE and RadGenome-ChestCT. (arXiv)

When to use: as an architectural blueprint (tools + compression + reasoning), even if you don’t adopt it wholesale.

TotalFM (Jan 2026; organ-separated 3D CT foundation direction)

  • TotalFM proposes an organ-separated framework for 3D CT foundation modeling and compares against CT-CLIP and Merlin in zero-shot settings. (arXiv)

When to use: as the newest “where research is going” reference for efficient 3D CT VLM design.

CT2Rep / BTB3D (report generation + better 3D tokenization lines)

  • CT2Rep targets automated report generation for chest CT volumes. (GitHub)
  • BTB3D focuses on improved tokenization for 3D medical VLMs (NeurIPS 2025). (OpenReview)

When to use: if you want to push beyond Q/A into report generation with explicit 3D modeling research.


The part that matters as much as model choice: how you prepare CT slices

1) Intensity correctness (HU + windowing)

Even though you provide PNGs, your pipeline should internally treat CT as HU and then window.

  • HU conversion uses Rescale Slope and Rescale Intercept to map stored values to HU. (Stack Overflow)
  • DICOM also explicitly clarifies that CT Rescale Type is Hounsfield Units (signed), and windowing behavior matters. (DICOM)

Practical recommendation: always provide multiple windows for the same slice set (e.g., lung + soft tissue ± bone), otherwise your model is blind to key findings by construction.

2) Do not send “all slices”

Most VLMs degrade sharply if you send too many near-duplicate slices.

Better strategies:

  • Montage-first: a 4Ă—4 (or 5Ă—5) montage of evenly sampled axial slices per window.
  • Then top-k singles: add a handful of high-resolution slices selected by retrieval or heuristics.

3) Slice selection should be question-aware

If the user asks about PE, hemorrhage, appendicitis, etc., your evidence packet should focus on relevant z-ranges/anatomy.

Two good ways to do that:

  • Retrieval using CT-CLIP-style embeddings (text query → relevant slices/regions). (arXiv)
  • Tool-based selection using segmentation (organ masks → choose representative slices per organ).

For segmentation, TotalSegmentator is a robust baseline tool for major anatomical structures. (GitHub)


What I would actually try first (a concrete shortlist)

If you want the best chance of success quickly (open models)

  1. Qwen2.5-VL Instruct (start with 7B; compare up to 32B/72B if possible) (Hugging Face)
  2. Idefics3-8B-Llama3 (multi-image stability benchmark) (Hugging Face)
  3. InternVL2.5 (cross-check baseline) (Hugging Face)
  4. Pixtral-12B-2409 (mid-size quality/latency comparison) (Hugging Face)
  5. MiniCPM-V 2.6 (or int4) if you need faster iteration / lower VRAM (Hugging Face)

If your focus is chest CT and you want CT-native references

  • CT-CLIP / CT-CHAT (domain-aligned 3D chest CT line) (arXiv)
  • Merlin (3D CT VLM foundation reference) (arXiv)
  • CT-Agent (agentic CT QA blueprint) (arXiv)
  • TotalFM (2026) (organ-separated 3D CT foundation direction) (arXiv)

Prompting patterns that work better for CT slice chat

Pattern A: Evidence-cited answering (reduces hallucinations)

Require the model to cite which slice tiles it used.

Example (conceptual):

  • Input: montage images with tile IDs (A1…D4), plus “Question: …”

  • Output schema:

    • Answer
    • Evidence used: [tile IDs]
    • Uncertainty / what’s missing
    • Next suggested views/windows (not clinical advice; just what images would clarify)

This simple structure tends to reduce “confident guessing” because it forces the model to ground.

Pattern B: Tool-augmented explanations

If you run segmentation or measurements, put them into the prompt as structured text:

  • organ volumes
  • detected candidate regions
  • HU statistics in ROI (if you compute them)

Then ask the model to explain the tool output rather than infer everything from pixels.

This matches the direction of CT-Agent-style pipelines (tools + reasoning). (arXiv)


How to compare models fairly on your platform

Once you have the evidence packet generator, model comparison becomes much easier.

Use a repeatable evaluation harness

  • OpenCompass maintains VLMEvalKit, an evaluation toolkit for many multi-modality models and benchmarks. (arXiv)
  • They also host an Open VLM leaderboard space (useful for triangulating baseline model strength). (Hugging Face)

Practical recommendation: create a small internal benchmark:

  • 30–100 anonymized research cases (or public CT-RATE-derived cases for chest),
  • 5–10 question types,
  • fixed evidence packet templates,
  • score: evidence consistency, omission rate, contradiction rate, and clinician spot-checks.

Summary recommendation

  1. Start with general VLM baselines (Qwen2.5-VL + Idefics3 + InternVL2.5) to validate your slice packaging, windowing, and prompting. (Hugging Face)
  2. Make your CT evidence packets strong (multi-window, montage + top-k singles, question-aware slice selection). HU/windowing correctness is foundational. (Stack Overflow)
  3. Add CT-native research models as “north stars” (CT-CLIP/CT-CHAT for chest CT; Merlin/TotalFM/CT-Agent as study-level references). (arXiv)
  4. Track model performance with a consistent harness (VLMEvalKit + your own CT-specific tests). (arXiv)