There’s no smart method I can recommend…
Probably using a backend that already has a cache, like vLLM, is relatively smart. Below, if using Transformers:
Recommendation
For Qwen3.5 in Transformers, the recommended way is not to look for a public image_embeds= argument on generate(). The current public API for Qwen3_5ForConditionalGeneration exposes pixel_values, image_grid_thw, mm_token_type_ids, and the generic inputs_embeds, but not a first-class visual-embedding input. The documented example is still the normal processor.apply_chat_template(...) → model.generate(**inputs) path. (Hugging Face)
So the practical recommendation is:
- Precompute at the model’s visual-feature boundary — the output of
get_image_features(...) or its Qwen3.5 equivalent internal visual feature extraction boundary.
- Cache the visual features together with
image_grid_thw and the exact model/processor revision.
- Later, inject those cached visual features through a thin wrapper/subclass that reproduces the stock multimodal fusion path, instead of trying to pass a bare image-embedding tensor directly to
generate(). (Hugging Face)
Why this is the right abstraction
inputs_embeds in the public forward signature is a generic token-embedding escape hatch: it means “I already have the full sequence embeddings for the model input.” It is not documented as “visual embeddings go here.” By contrast, the multimodal API explicitly documents pixel_values, image_grid_thw, and mm_token_type_ids, which tells you that Qwen3.5 expects a structured multimodal input contract, not just an opaque tensor. (Hugging Face)
That contract matters because image_grid_thw is part of how the model understands the image feature layout in the language model space. The Qwen3.5 docs describe image_grid_thw as the temporal, height, and width of each image’s feature shape in the LLM, and vLLM’s multimodal input docs say explicitly that image_grid_thw is needed to calculate positional encoding for Qwen-family image-embedding inputs. (Hugging Face)
What you should cache
For Qwen3.5 specifically, the safest cache object is:
- the LM-ready visual feature tensor(s) produced at the visual-feature boundary,
image_grid_thw,
- model ID + exact revision,
- processor ID + exact revision,
- any processor settings that affect visual tokenization/resolution. (Hugging Face)
That is a better boundary than caching raw pixel_values, because pixel_values is still the input to the expensive visual path. It is also a better boundary than caching the final full-sequence inputs_embeds, because full-sequence embeddings are tied much more tightly to one exact prompt layout and one exact generation-preparation path. (Hugging Face)
Why I would not make raw inputs_embeds your main interface
People do try this, but it is brittle. A Hugging Face issue on Qwen2-VL shows users manually constructing inputs_embeds for multimodal generation and hitting regressions across versions, and another issue shows generic inputs_embeds + cache/past-key-value generation problems. Those are not proofs that the approach is impossible, but they are a strong signal that inputs_embeds is the low-level plumbing, not the most stable public abstraction to build around. (GitHub)
There is also an architectural reason to be cautious: recent Transformers release notes note that 3D position IDs for vision-language models were unified under a common interface, which means code that manually reconstructs multimodal positions is exactly the kind of code that can get broken by framework changes. (GitHub)
The best pattern inside Transformers
Inside Transformers, I would treat this as a small adapter layer:
- Build the prompt normally with
apply_chat_template(...).
- Keep
input_ids, attention_mask, mm_token_type_ids, image_grid_thw.
- Replace the model’s visual-feature computation step with your cached features.
- Let the rest of the stock multimodal forward/generation path continue unchanged. (Hugging Face)
Conceptually, the adapter looks like this:
cache = build_visual_cache(image) # precompute once
out = generate_from_visual_cache(cache, prompt)
Internally, that adapter may use inputs_embeds, but the caller should not have to think in terms of raw sequence embeddings.
The best pattern if you want first-class support
If you want a public, supported API for image embeddings rather than a local wrapper, vLLM is ahead of Transformers here. Its multimodal input docs explicitly support image embedding inputs and require the extra Qwen metadata, including image_grid_thw for positional encoding. For Qwen3-VL, vLLM goes further and states that image_embeds should contain both the base image embedding and DeepStack features. (vLLM)
That is a useful design clue even for Qwen3.5: the right object is usually not “a naked image embedding tensor,” but a typed multimodal visual package plus required metadata.
Important distinction: Qwen3.5 vs Qwen3-VL
For Qwen3.5, the public docs expose the multimodal forward inputs but do not document DeepStack outputs on the model page. For Qwen3-VL, the docs explicitly document get_image_features(...), deepstack_features, and the multimodal forward signature. So:
- for Qwen3.5, cache the visual features at the visual-feature boundary plus
image_grid_thw;
- for Qwen3-VL, cache both the base visual outputs and
deepstack_features, plus image_grid_thw. (Hugging Face)
Clear answer to your question
If your question is:
“What is the recommended way of providing precomputed image_embeds to Qwen3.5?”
Then the answer is:
- There is no documented first-class
image_embeds= entry point on generate() for Qwen3.5 in Transformers today. (Hugging Face)
- The recommended engineering pattern is to cache the model’s visual features plus
image_grid_thw, then inject them through a thin wrapper/subclass that preserves the stock multimodal path. (Hugging Face)
- If you want a first-class embedding-input API instead of a wrapper, use a runtime that already exposes it, such as vLLM. (vLLM)
The shortest way to think about it is:
Do not treat the cache as “image_embeds.pt”.
Treat it as a reusable multimodal visual package for a specific Qwen-family model revision. (Hugging Face)
Qwen3.5-specific template
It hard-codes the correct patch point as model.model.get_image_features, because the current Qwen3.5 conditional-generation stack routes multimodal feature extraction through the inner Qwen3_5Model, not the outer wrapper. The public Qwen3.5 docs also show the supported multimodal inputs are pixel_values, image_grid_thw, and mm_token_type_ids, not a first-class image_embeds= argument. Recent Transformers releases also changed VLM 3D position handling, so keeping the stock path intact is the safer pattern. (Hugging Face)
The code below validates the intended approach in a strict way:
- baseline run from the real image,
- cache the output of
get_image_features(...),
- cached run with the inner feature method patched,
- trap the real vision tower so the run fails if visual recomputation happens. (Hugging Face)
# Compact Qwen3.5 best-practice template:
# precompute visual features once, cache them, then reuse them later for generation.
#
# Why this version avoids crash:
# - For Qwen3.5, patch model.model.get_image_features(...), not the outer model.
# - The cached run also traps model.model.visual.forward(...), so it will fail
# if the real vision tower is called by mistake.
#
# References:
# - Qwen3.5 docs:
# https://huggingface.co/docs/transformers/model_doc/qwen3_5
# - Transformers releases (3D position-id / generation-path changes):
# https://github.com/huggingface/transformers/releases
# - Sample image:
# https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg
#
# deps:
# pip install -U "torch>=2.3" "transformers>=5.3.0" accelerate pillow
#
# Notes:
# - CUDA: prefers bfloat16 if supported, else float16
# - CPU: uses float32
# - No argparse
# - Low-memory friendly: uses a small public checkpoint and caps image pixels if supported
# - This validates the pattern; it is not an official image_embeds= API
import gc
import types
from pathlib import Path
import torch
from transformers import AutoProcessor, Qwen3_5ForConditionalGeneration
# ----------------------------
# User settings
# ----------------------------
MODEL_ID = "Qwen/Qwen3.5-0.8B"
IMAGE_URL = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
CACHE_PATH = Path("qwen35_visual_cache.pt")
PROMPT = "Describe the image clearly in 2 short sentences."
MAX_NEW_TOKENS = 64
# ----------------------------
# Helpers
# ----------------------------
class AttrDict(dict):
__getattr__ = dict.get
__setattr__ = dict.__setitem__
def pick_device_and_dtype():
if torch.cuda.is_available():
device = torch.device("cuda")
dtype = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16
else:
device = torch.device("cpu")
dtype = torch.float32
return device, dtype
def maybe_make_processor(model_id: str):
# Reduce visual tokens on small RAM / VRAM setups if supported by this processor version.
try:
return AutoProcessor.from_pretrained(
model_id,
min_pixels=256 * 28 * 28,
max_pixels=512 * 28 * 28,
)
except TypeError:
return AutoProcessor.from_pretrained(model_id)
def move_batch_to_device(batch, device):
out = {}
for k, v in batch.items():
out[k] = v.to(device) if torch.is_tensor(v) else v
return out
def cpu_clone_tree(x):
if x is None:
return None
if torch.is_tensor(x):
return x.detach().cpu().contiguous()
if isinstance(x, list):
return [cpu_clone_tree(v) for v in x]
if isinstance(x, tuple):
return tuple(cpu_clone_tree(v) for v in x)
if isinstance(x, dict):
return {k: cpu_clone_tree(v) for k, v in x.items()}
return x
def runtime_cast_tree(x, device, float_dtype):
if x is None:
return None
if torch.is_tensor(x):
y = x.to(device)
if torch.is_floating_point(y):
y = y.to(float_dtype)
return y
if isinstance(x, list):
return [runtime_cast_tree(v, device, float_dtype) for v in x]
if isinstance(x, tuple):
return tuple(runtime_cast_tree(v, device, float_dtype) for v in x)
if isinstance(x, dict):
return {k: runtime_cast_tree(v, device, float_dtype) for k, v in x.items()}
return x
def total_nbytes(x):
if x is None:
return 0
if torch.is_tensor(x):
return x.numel() * x.element_size()
if isinstance(x, (list, tuple)):
return sum(total_nbytes(v) for v in x)
if isinstance(x, dict):
return sum(total_nbytes(v) for v in x.values())
return 0
def format_bytes(n):
units = ["B", "KB", "MB", "GB", "TB"]
n = float(n)
i = 0
while n >= 1024 and i < len(units) - 1:
n /= 1024.0
i += 1
return f"{n:.2f} {units[i]}"
def build_messages(prompt_text, image_url):
return [
{
"role": "user",
"content": [
{"type": "image", "image": image_url},
{"type": "text", "text": prompt_text},
],
}
]
def build_inputs(processor, messages):
batch = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt",
)
batch.pop("token_type_ids", None)
return batch
def decode_new_tokens(processor, prompt_input_ids, generated_ids):
trimmed = [
out_ids[len(in_ids):]
for in_ids, out_ids in zip(prompt_input_ids, generated_ids)
]
return processor.batch_decode(
trimmed,
skip_special_tokens=True,
clean_up_tokenization_spaces=False,
)[0].strip()
def run_generate(model, processor, inputs, label):
with torch.inference_mode():
generated_ids = model.generate(
**inputs,
max_new_tokens=MAX_NEW_TOKENS,
do_sample=False,
use_cache=True,
)
text = decode_new_tokens(processor, inputs["input_ids"], generated_ids)
print(f"\n[{label}]")
print(text)
return text
# ----------------------------
# Qwen3.5-specific cache boundary
# ----------------------------
def make_visual_cache(model, processor_inputs, cache_path: Path):
"""
Qwen3.5-specific:
the multimodal feature extraction lives on model.model.get_image_features(...).
"""
owner = model.model
with torch.inference_mode():
image_outputs = owner.get_image_features(
pixel_values=processor_inputs["pixel_values"],
image_grid_thw=processor_inputs["image_grid_thw"],
return_dict=True,
)
cache = {
"model_id": MODEL_ID,
"image_grid_thw": cpu_clone_tree(processor_inputs["image_grid_thw"]),
"prompt_skeleton": {
"input_ids": cpu_clone_tree(processor_inputs["input_ids"]),
"attention_mask": cpu_clone_tree(processor_inputs["attention_mask"]),
"mm_token_type_ids": cpu_clone_tree(processor_inputs.get("mm_token_type_ids")),
},
# Keep the full visual output object as a plain dict.
"visual_outputs": cpu_clone_tree(dict(image_outputs)),
}
torch.save(cache, cache_path)
print("\n[cache stats]")
print("pixel_values bytes :", format_bytes(total_nbytes(processor_inputs["pixel_values"])))
print("cached visual bytes:", format_bytes(total_nbytes(cache["visual_outputs"])))
print("cache file :", str(cache_path.resolve()))
return cache
# ----------------------------
# Qwen3.5-specific patching
# ----------------------------
def install_cached_visual_patch(model, cache, device, float_dtype):
"""
Patch the INNER Qwen3.5 model, not the outer generation wrapper.
Also trap the real vision tower to prove cached reuse is actually happening.
"""
owner = model.model # <-- this is the important fix
original_get_image_features = owner.get_image_features
original_visual_forward = owner.visual.forward
patch_state = {
"patched_calls": 0,
"real_visual_calls": 0,
}
def patched_get_image_features(self, pixel_values=None, image_grid_thw=None, **kwargs):
patch_state["patched_calls"] += 1
return AttrDict(runtime_cast_tree(cache["visual_outputs"], device, float_dtype))
def trapped_visual_forward(self, *args, **kwargs):
patch_state["real_visual_calls"] += 1
raise RuntimeError(
"Real Qwen3.5 vision tower was called during cached run. "
"The cached path did not bypass visual recomputation."
)
owner.get_image_features = types.MethodType(patched_get_image_features, owner)
owner.visual.forward = types.MethodType(trapped_visual_forward, owner.visual)
return owner, original_get_image_features, original_visual_forward, patch_state
def restore_cached_visual_patch(owner, original_get_image_features, original_visual_forward):
owner.get_image_features = original_get_image_features
owner.visual.forward = original_visual_forward
def make_cached_inputs(cache, device, float_dtype):
"""
Keep the multimodal branch active with a tiny non-empty sentinel pixel_values tensor.
The patched model.model.get_image_features(...) ignores it completely.
"""
prompt = cache["prompt_skeleton"]
out = {
"input_ids": prompt["input_ids"].to(device),
"attention_mask": prompt["attention_mask"].to(device),
"image_grid_thw": cache["image_grid_thw"].to(device),
"pixel_values": torch.zeros((1,), device=device, dtype=float_dtype),
}
if prompt.get("mm_token_type_ids") is not None:
out["mm_token_type_ids"] = prompt["mm_token_type_ids"].to(device)
return out
# ----------------------------
# Main
# ----------------------------
def main():
device, dtype = pick_device_and_dtype()
print("[runtime]")
print("device:", device)
print("dtype :", dtype)
processor = maybe_make_processor(MODEL_ID)
model = Qwen3_5ForConditionalGeneration.from_pretrained(
MODEL_ID,
torch_dtype=dtype,
low_cpu_mem_usage=True,
attn_implementation="sdpa",
).to(device)
model.eval()
# 1) Baseline run
messages = build_messages(PROMPT, IMAGE_URL)
inputs = move_batch_to_device(build_inputs(processor, messages), device)
print("\n[input keys]")
print(sorted(inputs.keys()))
print("image_grid_thw:", inputs["image_grid_thw"].tolist())
baseline_text = run_generate(model, processor, inputs, "baseline / normal image path")
# 2) Build cache at the visual-feature boundary
_ = make_visual_cache(model, inputs, CACHE_PATH)
# Simulate "later"
if "pixel_values" in inputs:
del inputs["pixel_values"]
gc.collect()
if torch.cuda.is_available():
torch.cuda.empty_cache()
# 3) Cached run
cache = torch.load(CACHE_PATH, map_location="cpu")
if cache["model_id"] != MODEL_ID:
raise ValueError(f"Cache was created for {cache['model_id']}, current model is {MODEL_ID}.")
owner, original_get_image_features, original_visual_forward, patch_state = install_cached_visual_patch(
model=model,
cache=cache,
device=device,
float_dtype=dtype,
)
try:
cached_inputs = make_cached_inputs(cache, device, dtype)
cached_text = run_generate(model, processor, cached_inputs, "cached visual path")
finally:
restore_cached_visual_patch(owner, original_get_image_features, original_visual_forward)
# 4) Validation
print("\n[validation]")
print("patched get_image_features calls:", patch_state["patched_calls"])
print("real visual.forward calls :", patch_state["real_visual_calls"])
print("baseline == cached :", baseline_text == cached_text)
print("cache file :", str(CACHE_PATH.resolve()))
if patch_state["patched_calls"] < 1:
raise RuntimeError("Patched model.model.get_image_features was never called.")
if patch_state["real_visual_calls"] != 0:
raise RuntimeError("The real Qwen3.5 vision tower ran during the cached path.")
print("\n[result]")
if baseline_text == cached_text:
print("Success: cached visual features reproduced the same output.")
else:
print("Cached path succeeded and bypassed visual recomputation.")
print("Text differs from baseline, which can still happen across kernels/dtypes/versions.")
if __name__ == "__main__":
main()