Since this is a medical field, I recommend consulting Hugging Science as well. If the input image is singular, you can simply use a good VLM. However, if the input images are multiple:
What you are building, in model terms
Your interface (“CT slices as PNG + prompt → text”) is effectively a multi-image vision-language problem with two extra challenges:
- CT is 3D, but you’re feeding 2D (hundreds of slices compressed into a small set of images).
- CT meaning depends on intensity handling (Hounsfield Units + windowing). If your PNG export is off, even the best model will fail.
So the best strategy is to evaluate models in tiers:
- Tier A (general VLM baselines): easiest to integrate; best for validating your slice packaging and UX.
- Tier B (medical/radiology VLMs): better medical language priors; often more brittle.
- Tier C (CT/3D-native research models): closest to “study-level CT understanding,” but typically requires different preprocessing than simple PNG slices.
Tier A — Open, strong “works now” VLM baselines (start here)
These are the models I would try first because they are strong general VLMs and commonly used as baselines.
1) Qwen2.5-VL (Instruct) — strong, recent open family
- Why for your case: good all-around vision-language performance; practical baseline to test multi-slice prompting and structured outputs.
- The official Hugging Face collection shows Qwen2.5-VL updated through Dec 31, 2025. (Hugging Face)
- Example model card (72B instruct): (Hugging Face)
When to use: primary baseline if you can host 7B/32B/72B variants for quality/latency comparisons.
2) Qwen3-VL — newer generation in Transformers docs
- Why for your case: documented as a newer series with dense + MoE and “Instruct” + “Thinking” variants; useful if you want better visual reasoning while keeping open tooling. (Hugging Face)
When to use: if you want “latest-ish open family” with clean integration via Transformers.
3) Idefics3-8B-Llama3 — explicitly designed for arbitrary sequences of images
- Why for your case: your input is multiple slices; this model explicitly supports “arbitrary sequences of image and text inputs and produces text outputs.” (Hugging Face)
When to use: as the “multi-image robustness” baseline (especially if you pass >10 images or multiple montages).
4) InternVL2.5 — strong open multimodal family
- Why for your case: a well-known open multimodal family with multiple sizes and quantized variants; good for cross-checking if failures are “your packaging” vs “model limitation.”
- HF collection updated Sep 28, 2025. (Hugging Face)
When to use: as a second baseline alongside Qwen/Idefics.
5) Pixtral-12B — mid-size high-quality baseline
- Why for your case: a clean mid-size VLM option; good quality/compute tradeoff.
- Model card notes 12B parameters + a 400M vision encoder. (Hugging Face)
When to use: if you want a strong model around the 10–15B class for interactive UI.
6) Llama 3.2 Vision Instruct — ecosystem-friendly baseline
- Why for your case: widely supported; “text + images in / text out” model family with 11B and 90B sizes. (Hugging Face)
When to use: if you want maximum ecosystem compatibility and common deployment paths.
7) MiniCPM-V 2.6 / MiniCPM-o 2.6 — good for multi-image + low-memory experiments
- MiniCPM-V 2.6 model card explicitly calls out multi-image support. (Hugging Face)
- There is an int4 variant claiming lower memory usage. (Hugging Face)
- MiniCPM-o 2.6 is presented as a strong multimodal model with evaluation claims in its card. (Hugging Face)
When to use: if you want fast iteration, quantized deployment, or want to test multi-image behavior cheaply.
Tier B — Radiology / medical VLMs (add after you have a baseline)
Medical-tuned VLMs can improve language style and some domain priors, but they also vary widely in training quality and evaluation rigor.
RadFM (radiology foundation model line)
- The RadFM paper frames RadFM as a generalist radiology foundation effort with large-scale 2D/3D data. (Nature)
- There is an HF repo and a GitHub repo referencing model checkpoints. (Hugging Face)
When to use: if you want radiology-oriented priors and are willing to handle research-grade setup.
Tier C — CT / 3D-native research models (closest to “study-level” CT)
If your long-term goal is “CT study understanding” rather than “slice captioning,” these papers/projects are the right background—and in some cases offer usable checkpoints.
Merlin (3D CT VLM)
- Merlin is explicitly a “vision-language foundation model for 3D CT,” trained with CT + reports + diagnosis codes, and evaluated across many tasks. (arXiv)
When to use: as a research reference or if you want to experiment with 3D-native approaches (likely beyond pure PNG-slice chat).
CT-RATE / CT-CLIP / CT-CHAT (chest CT focused)
- CT-RATE introduces a large chest CT dataset paired with reports and describes CT-CLIP and CT-CHAT built on it. (arXiv)
- The CT-CLIP GitHub repo positions CT-CHAT as a 3D chest CT chat model built from CT-CLIP. (GitHub)
- The CT-RATE dataset page contains CT-CHAT description and related assets. (Hugging Face)
- A discussion thread mentions running CT-CHAT via provided scripts and model paths. (Hugging Face)
When to use: if your primary use case is non-contrast chest CT and you want a domain-aligned research baseline.
CT-Agent (agentic framework for CT QA)
- CT-Agent is specifically about handling CTQA by decomposing anatomy and using a global-local token compression strategy, evaluated on CT-RATE and RadGenome-ChestCT. (arXiv)
When to use: as an architectural blueprint (tools + compression + reasoning), even if you don’t adopt it wholesale.
TotalFM (Jan 2026; organ-separated 3D CT foundation direction)
- TotalFM proposes an organ-separated framework for 3D CT foundation modeling and compares against CT-CLIP and Merlin in zero-shot settings. (arXiv)
When to use: as the newest “where research is going” reference for efficient 3D CT VLM design.
CT2Rep / BTB3D (report generation + better 3D tokenization lines)
- CT2Rep targets automated report generation for chest CT volumes. (GitHub)
- BTB3D focuses on improved tokenization for 3D medical VLMs (NeurIPS 2025). (OpenReview)
When to use: if you want to push beyond Q/A into report generation with explicit 3D modeling research.
The part that matters as much as model choice: how you prepare CT slices
1) Intensity correctness (HU + windowing)
Even though you provide PNGs, your pipeline should internally treat CT as HU and then window.
- HU conversion uses Rescale Slope and Rescale Intercept to map stored values to HU. (Stack Overflow)
- DICOM also explicitly clarifies that CT Rescale Type is Hounsfield Units (signed), and windowing behavior matters. (DICOM)
Practical recommendation: always provide multiple windows for the same slice set (e.g., lung + soft tissue ± bone), otherwise your model is blind to key findings by construction.
2) Do not send “all slices”
Most VLMs degrade sharply if you send too many near-duplicate slices.
Better strategies:
- Montage-first: a 4Ă—4 (or 5Ă—5) montage of evenly sampled axial slices per window.
- Then top-k singles: add a handful of high-resolution slices selected by retrieval or heuristics.
3) Slice selection should be question-aware
If the user asks about PE, hemorrhage, appendicitis, etc., your evidence packet should focus on relevant z-ranges/anatomy.
Two good ways to do that:
- Retrieval using CT-CLIP-style embeddings (text query → relevant slices/regions). (arXiv)
- Tool-based selection using segmentation (organ masks → choose representative slices per organ).
For segmentation, TotalSegmentator is a robust baseline tool for major anatomical structures. (GitHub)
What I would actually try first (a concrete shortlist)
If you want the best chance of success quickly (open models)
- Qwen2.5-VL Instruct (start with 7B; compare up to 32B/72B if possible) (Hugging Face)
- Idefics3-8B-Llama3 (multi-image stability benchmark) (Hugging Face)
- InternVL2.5 (cross-check baseline) (Hugging Face)
- Pixtral-12B-2409 (mid-size quality/latency comparison) (Hugging Face)
- MiniCPM-V 2.6 (or int4) if you need faster iteration / lower VRAM (Hugging Face)
If your focus is chest CT and you want CT-native references
- CT-CLIP / CT-CHAT (domain-aligned 3D chest CT line) (arXiv)
- Merlin (3D CT VLM foundation reference) (arXiv)
- CT-Agent (agentic CT QA blueprint) (arXiv)
- TotalFM (2026) (organ-separated 3D CT foundation direction) (arXiv)
Prompting patterns that work better for CT slice chat
Pattern A: Evidence-cited answering (reduces hallucinations)
Require the model to cite which slice tiles it used.
Example (conceptual):
This simple structure tends to reduce “confident guessing” because it forces the model to ground.
Pattern B: Tool-augmented explanations
If you run segmentation or measurements, put them into the prompt as structured text:
- organ volumes
- detected candidate regions
- HU statistics in ROI (if you compute them)
Then ask the model to explain the tool output rather than infer everything from pixels.
This matches the direction of CT-Agent-style pipelines (tools + reasoning). (arXiv)
How to compare models fairly on your platform
Once you have the evidence packet generator, model comparison becomes much easier.
Use a repeatable evaluation harness
- OpenCompass maintains VLMEvalKit, an evaluation toolkit for many multi-modality models and benchmarks. (arXiv)
- They also host an Open VLM leaderboard space (useful for triangulating baseline model strength). (Hugging Face)
Practical recommendation: create a small internal benchmark:
- 30–100 anonymized research cases (or public CT-RATE-derived cases for chest),
- 5–10 question types,
- fixed evidence packet templates,
- score: evidence consistency, omission rate, contradiction rate, and clinician spot-checks.
Summary recommendation
- Start with general VLM baselines (Qwen2.5-VL + Idefics3 + InternVL2.5) to validate your slice packaging, windowing, and prompting. (Hugging Face)
- Make your CT evidence packets strong (multi-window, montage + top-k singles, question-aware slice selection). HU/windowing correctness is foundational. (Stack Overflow)
- Add CT-native research models as “north stars” (CT-CLIP/CT-CHAT for chest CT; Merlin/TotalFM/CT-Agent as study-level references). (arXiv)
- Track model performance with a consistent harness (VLMEvalKit + your own CT-specific tests). (arXiv)