An error occurred after modify the patch size of qwen image edit

After changing the patch_size parameter in qwen-image-edit-2509/processor/preprocessor_config.json from 14 to 7, I can train normally, but the error occured when I try to inference: /pytorch/aten/src/ATen/native/cuda/IndexKernelUtils.cu:16: vectorized_gather_kernel: block: [450,0,0], thread: [0,0,0] Assertion ind >=0 && ind < ind_dim_size && "vectorized gather kernel index out of bounds" failed.

Below is traceback:
AcceleratorError Traceback (most recent call last)
Cell In[3], line 26
24 image = Image.open(f"{image_name}.png").convert(“RGB”)
25 width, height = image.size
—> 26 images_out = pipe(image, prompt,negative_prompt=“脸部出现红晕”, num_inference_steps=15, output_type=‘pil’, true_cfg_scale=4.0).images
27 save_image = images_out[0].resize((width, height))
28 save_image.save(save_image_name)

File ~/miniconda3/lib/python3.12/site-packages/torch/utils/_contextlib.py:120, in context_decorator..decorate_context(*args, **kwargs)
117 @functools.wrapsfunctools.wraps(func)
118 def decorate_context(*args, **kwargs):
119 with ctx_factory():
→ 120 return func(*args, **kwargs)

File ~/miniconda3/lib/python3.12/site-packages/diffusers/pipelines/qwenimage/pipeline_qwenimage_edit_plus.py:700, in QwenImageEditPlusPipeline.call(self, image, prompt, negative_prompt, true_cfg_scale, height, width, num_inference_steps, sigmas, guidance_scale, num_images_per_prompt, generator, latents, prompt_embeds, prompt_embeds_mask, negative_prompt_embeds, negative_prompt_embeds_mask, output_type, return_dict, attention_kwargs, callback_on_step_end, callback_on_step_end_tensor_inputs, max_sequence_length)
695 logger.warning(
696 " negative_prompt is passed but classifier-free guidance is not enabled since true_cfg_scale <= 1"
697 )
699 do_true_cfg = true_cfg_scale > 1 and has_neg_prompt
→ 700 prompt_embeds, prompt_embeds_mask = self.encode_prompt(
701 image=condition_images,
702 prompt=prompt,
703 prompt_embeds=prompt_embeds,
704 prompt_embeds_mask=prompt_embeds_mask,
705 device=device,
706 num_images_per_prompt=num_images_per_prompt,
707 max_sequence_length=max_sequence_length,
708 )
709 if do_true_cfg:
710 negative_prompt_embeds, negative_prompt_embeds_mask = self.encode_prompt(
711 image=condition_images,
712 prompt=negative_prompt,
(…) 717 max_sequence_length=max_sequence_length,
718 )

File ~/miniconda3/lib/python3.12/site-packages/diffusers/pipelines/qwenimage/pipeline_qwenimage_edit_plus.py:318, in QwenImageEditPlusPipeline.encode_prompt(self, prompt, image, device, num_images_per_prompt, prompt_embeds, prompt_embeds_mask, max_sequence_length)
315 batch_size = len(prompt) if prompt_embeds is None else prompt_embeds.shape[0]
317 if prompt_embeds is None:
→ 318 prompt_embeds, prompt_embeds_mask = self._get_qwen_prompt_embeds(prompt, image, device)
320 _, seq_len, _ = prompt_embeds.shape
321 prompt_embeds = prompt_embeds.repeat(1, num_images_per_prompt, 1)

File ~/miniconda3/lib/python3.12/site-packages/diffusers/pipelines/qwenimage/pipeline_qwenimage_edit_plus.py:262, in QwenImageEditPlusPipeline._get_qwen_prompt_embeds(self, prompt, image, device, dtype)
253 txt = [template.format(base_img_prompt + e) for e in prompt]
255 model_inputs = self.processor(
256 text=txt,
257 images=image,
258 padding=True,
259 return_tensors=“pt”,
260 ).to(device)
→ 262 outputs = self.text_encoder(
263 input_ids=model_inputs.input_ids,
264 attention_mask=model_inputs.attention_mask,
265 pixel_values=model_inputs.pixel_values,
266 image_grid_thw=model_inputs.image_grid_thw,
267 output_hidden_states=True,
268 )
270 hidden_states = outputs.hidden_states[-1]
271 split_hidden_states = self._extract_masked_hidden(hidden_states, model_inputs.attention_mask)

File ~/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py:1775, in Module._wrapped_call_impl(self, *args, **kwargs)
1773 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1774 else:
→ 1775 return self._call_impl(*args, **kwargs)

File ~/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py:1786, in Module._call_impl(self, *args, **kwargs)
1781 # If we don’t have any hooks, we want to skip the rest of the logic in
1782 # this function, and just call forward.
1783 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1784 or _global_backward_pre_hooks or _global_backward_hooks
1785 or _global_forward_hooks or _global_forward_pre_hooks):
→ 1786 return forward_call(*args, **kwargs)
1788 result = None
1789 called_always_called_hooks = set()

File ~/miniconda3/lib/python3.12/site-packages/accelerate/hooks.py:175, in add_hook_to_module..new_forward(module, *args, **kwargs)
173 output = module._old_forward(*args, **kwargs)
174 else:
→ 175 output = module._old_forward(*args, **kwargs)
176 return module._hf_hook.post_forward(module, output)

File ~/miniconda3/lib/python3.12/site-packages/transformers/utils/generic.py:959, in can_return_tuple..wrapper(self, *args, **kwargs)
957 if return_dict_passed is not None:
958 return_dict = return_dict_passed
→ 959 output = func(self, *args, **kwargs)
960 if not return_dict and not isinstance(output, tuple):
961 output = output.to_tuple()

File ~/miniconda3/lib/python3.12/site-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py:1493, in Qwen2_5_VLForConditionalGeneration.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, pixel_values, pixel_values_videos, image_grid_thw, video_grid_thw, rope_deltas, cache_position, second_per_grid_ts, logits_to_keep, **kwargs)
1488 output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
1489 output_hidden_states = (
1490 output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
1491 )
→ 1493 outputs = self.model(
1494 input_ids=input_ids,
1495 pixel_values=pixel_values,
1496 pixel_values_videos=pixel_values_videos,
1497 image_grid_thw=image_grid_thw,
1498 video_grid_thw=video_grid_thw,
1499 second_per_grid_ts=second_per_grid_ts,
1500 position_ids=position_ids,
1501 attention_mask=attention_mask,
1502 past_key_values=past_key_values,
1503 inputs_embeds=inputs_embeds,
1504 use_cache=use_cache,
1505 output_attentions=output_attentions,
1506 output_hidden_states=output_hidden_states,
1507 return_dict=True,
1508 cache_position=cache_position,
1509 **kwargs,
1510 )
1512 hidden_states = outputs[0]
1514 # Only compute necessary logits, and do not upcast them to float if we are not computing the loss

File ~/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py:1775, in Module._wrapped_call_impl(self, *args, **kwargs)
1773 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1774 else:
→ 1775 return self._call_impl(*args, **kwargs)

File ~/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py:1786, in Module._call_impl(self, *args, **kwargs)
1781 # If we don’t have any hooks, we want to skip the rest of the logic in
1782 # this function, and just call forward.
1783 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1784 or _global_backward_pre_hooks or _global_backward_hooks
1785 or _global_forward_hooks or _global_forward_pre_hooks):
→ 1786 return forward_call(*args, **kwargs)
1788 result = None
1789 called_always_called_hooks = set()

File ~/miniconda3/lib/python3.12/site-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py:1275, in Qwen2_5_VLModel.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict, pixel_values, pixel_values_videos, image_grid_thw, video_grid_thw, rope_deltas, cache_position, second_per_grid_ts, **kwargs)
1272 inputs_embeds = self.get_input_embeddings()(input_ids)
1274 if pixel_values is not None:
→ 1275 image_embeds = self.get_image_features(pixel_values, image_grid_thw)
1276 image_embeds = torch.cat(image_embeds, dim=0).to(inputs_embeds.device, inputs_embeds.dtype)
1277 image_mask, _ = self.get_placeholder_mask(
1278 input_ids, inputs_embeds=inputs_embeds, image_features=image_embeds
1279 )

File ~/miniconda3/lib/python3.12/site-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py:1188, in Qwen2_5_VLModel.get_image_features(self, pixel_values, image_grid_thw)
1178 “”"
1179 Encodes images into continuous embeddings that can be forwarded to the language model.
1180
(…) 1185 The temporal, height and width of feature shape of each image in LLM.
1186 “”"
1187 pixel_values = pixel_values.type(self.visual.dtype)
→ 1188 image_embeds = self.visual(pixel_values, grid_thw=image_grid_thw)
1189 split_sizes = (image_grid_thw.prod(-1) // self.visual.spatial_merge_size**2).tolist()
1190 image_embeds = torch.split(image_embeds, split_sizes)

File ~/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py:1775, in Module._wrapped_call_impl(self, *args, **kwargs)
1773 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1774 else:
→ 1775 return self._call_impl(*args, **kwargs)

File ~/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py:1786, in Module._call_impl(self, *args, **kwargs)
1781 # If we don’t have any hooks, we want to skip the rest of the logic in
1782 # this function, and just call forward.
1783 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1784 or _global_backward_pre_hooks or _global_backward_hooks
1785 or _global_forward_hooks or _global_forward_pre_hooks):
→ 1786 return forward_call(*args, **kwargs)
1788 result = None
1789 called_always_called_hooks = set()

File ~/miniconda3/lib/python3.12/site-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py:459, in Qwen2_5_VisionTransformerPretrainedModel.forward(self, hidden_states, grid_thw, **kwargs)
457 hidden_states = hidden_states.reshape(seq_len, -1)
458 rotary_pos_emb = rotary_pos_emb.reshape(seq_len // self.spatial_merge_unit, self.spatial_merge_unit, -1)
→ 459 rotary_pos_emb = rotary_pos_emb[window_index, :, :]
460 rotary_pos_emb = rotary_pos_emb.reshape(seq_len, -1)
461 emb = torch.cat((rotary_pos_emb, rotary_pos_emb), dim=-1)

AcceleratorError: CUDA error: device-side assert triggered
Search for cudaErrorAssert' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA` to enable device-side assertions.

Below is my inference code:
pipe = QwenImageEditPlusPipeline.from_pretrained(“autodl-tmp/qwen-image-edit-2509”, torch_dtype=torch.bfloat16,height=512, width=512, output_type=‘np’, true_cfg_scale=4.0)
#pipe.to(“cuda:0”)
pipe.load_lora_weights(“qwen-image-finetune/output/all/qwen_image_finetune/v1/checkpoint-last-0-120-last”)
pipe.enable_model_cpu_offload()

pipe(image, prompt,negative_prompt=“”, num_inference_steps=15, output_type=‘pil’, true_cfg_scale=4.0).images

1 Like

It’s probably best to avoid that change if possible…


The short version:

  • The error is caused by an internal shape/index mismatch that you introduced by changing patch_size from 14 to 7 only in processor/preprocessor_config.json.
  • In Qwen-Image-Edit / Qwen2.5-VL, patch_size is a structural property of the vision encoder, not a free hyperparameter you can tweak just in the processor.
  • Training can appear to “work” while inference crashes, because not all code paths and image shapes were exercised during training.
  • The only realistic fix with current public models is to restore patch_size=14, use the official configuration, and control detail via resolution (min_pixels / max_pixels, image size) instead of changing patch size.

Below is a detailed but step-by-step explanation.


1. What patch_size actually does in Qwen2.5-VL / Qwen-Image-Edit

1.1. Vision side architecture

Qwen-Image-Edit-2509 uses a Qwen2.5-VL-style vision encoder under the hood. The vision encoder:

  • Takes an image.

  • Splits it into patches of size patch_size × patch_size pixels.

  • Applies a temporal patching factor (temporal_patch_size, default 2) and a spatial merge (merge_size, default 2).

  • Produces:

    • Visual tokens (patch embeddings).
    • A grid descriptor image_grid_thw with shape [batch, 3] = (T, H, W) patch counts.

From the official Qwen2-VL / Qwen2.5-VL docs and code:

  • The processor has:

    • patch_size (default 14),
    • temporal_patch_size (default 2),
    • merge_size (default 2). (Hugging Face)
  • The vision encoder itself is designed around patch_size=14 and merges 2×2 neighboring patches, so each final token roughly corresponds to a 28×28 region. (Zenn)

The Hugging Face processor docs explicitly call patch_size:

“The spatial patch size of the vision encoder.” (Hugging Face)

So patch_size isn’t just a resize knob. It must match how the vision backbone actually embeds patches.

1.2. How that connects to your traceback

Inside Qwen2_5_VLModel:

  1. The processor outputs:

    • pixel_values: image tensor.
    • image_grid_thw: [T, H, W] patch counts.
  2. Qwen2_5_VLModel.get_image_features calls the vision tower:

    image_embeds = self.visual(pixel_values, grid_thw=image_grid_thw)
    
  3. Inside Qwen2_5_VisionTransformerPretrainedModel.forward, it:

    • Uses grid_thw to build a window index (window_index).

    • Reshapes hidden states and rotary embeddings into windows.

    • Indexes them:

      hidden_states = hidden_states.reshape(seq_len // self.spatial_merge_unit,
                                            self.spatial_merge_unit, -1)
      hidden_states = hidden_states[window_index, :, :]
      
      rotary_pos_emb = rotary_pos_emb.reshape(seq_len // self.spatial_merge_unit,
                                              self.spatial_merge_unit, -1)
      rotary_pos_emb = rotary_pos_emb[window_index, :, :]  # <-- your crash
      

If window_index contains any index ≥ the number of windows, you get exactly:

vectorized_gather_kernel: Assertion ind >= 0 && ind < ind_dim_size

This is what you see in:

Qwen2_5_VisionTransformerPretrainedModel.forward
→ rotary_pos_emb = rotary_pos_emb[window_index, :, :]

2. What your change (14 → 7) actually did

You changed:

"patch_size": 14  →  "patch_size": 7

only in qwen-image-edit-2509/processor/preprocessor_config.json.

That means:

  • The processor now thinks the vision encoder uses 7×7 patches.

  • It computes image_grid_thw assuming a finer grid:

    • For the same image, H and W roughly double (so 4× more patches).
  • However, the vision encoder weights and config remain those of the original Qwen2.5-VL vision tower:

    • All convolutional patch embeddings are still defined for 14×14 patches.
    • The internal merge/window logic (including spatial_merge_unit) is still tuned for patch_size=14, temporal_patch_size=2, merge_size=2.

So now, at runtime:

  • grid_thw encodes “how tokens would be laid out if patch_size=7”.
  • hidden_states encodes “how tokens are actually laid out with patch_size=14 and merging”.
  • The model combines them to compute window_index, which becomes inconsistent with the real sequence length.

The result: some window_index values exceed the valid range, and the GPU kernel asserts.

This is not hypothetical: the Qwen-Image team has an open GitHub issue that is literally your setup:

“Change qwen-image-edit-2509/processor/preprocessor_config.json.patch_size from 14 to 7. Finetuning succeeds, but inference crashes with
/pytorch/aten/src/ATen/native/cuda/IndexKernelUtils.cu:16: vectorized_gather_kernel ... Assertion ind >= 0 && ind < ind_dim_size.” (GitHub)

So your error is a known consequence of that config change.


3. Why training can “work” while inference crashes

It feels strange that training runs but inference dies. A few realistic reasons:

  1. Different code paths are exercised

    Many fine-tuning recipes for Qwen-Image-Edit:

    • Freeze the Qwen2.5-VL vision tower.
    • Or precompute image/text embeddings.
    • Or don’t fully run the same encode_prompt(...) path used in the diffusers inference pipeline.

    That means the exact combination of shapes (image size, grid_thw, batch size) that triggers the out-of-bounds gather may never appear during your training run, but appears during inference.

  2. Resolution differences

    Changing patch_size changes how many patches there are for a given image resolution.

    • Training may use one fixed resolution or random crop.
    • Inference may use another (your call uses height=512, width=512 on the pipeline constructor).

    Some resolutions may “accidentally” avoid an invalid window_index; others hit it.

  3. Even if training doesn’t crash, it is not well-defined

    Because the processor and vision encoder disagree about patch layout:

    • The image features that get passed into the text encoder are misaligned with the positions the model expects.
    • LoRA learns to compensate for a broken representation, not a cleanly defined “patch_size=7 Qwen2.5-VL”.

So “training runs” does not mean the setup is correct; it just means you didn’t hit a hard runtime failure in that particular loop.


4. Why changing patch_size is not supported for this model

Several upstream references make it clear that patch_size is part of the frozen architecture for each model family:

  • Qwen2-VL / Qwen2.5-VL docs: patch size is a config field of the vision encoder, default 14, with temporal_patch_size=2 and merge_size=2. (Hugging Face)

  • Qwen2-VL design notes and code walkthroughs: image patches are 14×14, then contiguous 2×2 tokens are merged, so effectively each final visual token covers about 28×28 pixels. (arXiv)

  • The Qwen3-VL repo and utilities explicitly define model-specific patch sizes:

    • image_patch_size = 14 for Qwen2.5-VL.
    • image_patch_size = 16 for Qwen3-VL. (GitHub)

That is, each model checkpoint is trained for a particular patch size and derived token layout.

On the diffusion side of Qwen-Image-Edit there is another patch size: the DiT output layer uses:

self.proj_out = nn.Linear(inner_dim, patch_size * patch_size * out_channels, bias=True)

in the official code. (Hugging Face)

So:

  • The visual backbone (Qwen2.5-VL encoder) expects patch_size=14.
  • The diffusion backbone (QwenImageTransformer2DModel) has its own patch_size (2 in the released checkpoints).
  • The processor ties all of this together.

Changing only the processor’s patch_size without modifying and re-training all corresponding backbones produces exactly the inconsistencies you are seeing.


5. Concrete solutions

5.1. Recommended: revert patch_size to 14 and keep the official architecture

For Qwen-Image-Edit-2509, the only configuration that is guaranteed to work with the released weights is:

"patch_size": 14,
"temporal_patch_size": 2,
"merge_size": 2

in processor/preprocessor_config.json. (Hugging Face)

Steps:

  1. Restore the original preprocessor_config.json:

    "patch_size": 14
    
  2. Make sure you are not loading a locally edited copy of the model anywhere else (e.g. copied folders, cached configs).

  3. Recreate your pipeline from a clean base:

    from diffusers import QwenImageEditPlusPipeline
    import torch
    
    base_model = "autodl-tmp/qwen-image-edit-2509"  # or the original HF ID
    
    pipe = QwenImageEditPlusPipeline.from_pretrained(
        base_model,
        torch_dtype=torch.bfloat16,  # or float16 depending on GPU
    )
    pipe.load_lora_weights(
        "qwen-image-finetune/output/all/qwen_image_finetune/v1/checkpoint-last-0-120-last"
    )
    pipe.enable_model_cpu_offload()
    
  4. If your LoRA was trained after you changed patch_size, it was trained on a broken config.
    Re-run training with the restored patch_size=14 so that the LoRA sees a consistent architecture.

  5. Then run inference (no changes besides the reverted config):

    images_out = pipe(
        image,
        prompt,
        negative_prompt="",
        num_inference_steps=15,
        output_type="pil",
        true_cfg_scale=4.0,
    ).images
    
  6. Optional but recommended: use torch.float16 on GPUs without native bfloat16 (e.g. T4) to avoid dtype issues unrelated to this bug.

This is the only fix that is fully compatible with the released checkpoint and the way Qwen2.5-VL is documented to work.


5.2. How to get “more detail” without changing patch_size

If your goal was to “increase detail” by using smaller patches, the supported way with Qwen2-VL / Qwen2.5-VL is to change the resolution range, not the patch size.

From the official Qwen2.5-VL docs:

  • Images are resized so that they produce between 256–1024 tokens per image.
  • The 28 factor comes from patch_size=14 and temporal_patch_size=2 (14×2=28). (Hugging Face)

You can adjust the resolution like this:

from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
ip = processor.image_processor

# Example: allow larger images (more tokens)
ip.min_pixels = 256 * 28 * 28   # lower bound
ip.max_pixels = 2048 * 28 * 28  # upper bound, more than default 1024 tokens

processor.save_pretrained("my-qwen2_5-vl-processor")

For Qwen-Image-Edit, the same idea applies:

  • Keep patch_size=14.
  • Adjust min_pixels / max_pixels (or the actual height / width you feed into the pipeline) to control how many visual tokens each image yields.

This gives you more or fewer tokens per image while preserving internal consistency.


5.3. If you truly want patch_size=7 (research-only, not plug-and-play)

To really use patch_size=7 safely, you would need:

  1. A vision encoder whose architecture is built for patch_size=7:

    • Different conv kernel/stride in the patch embed.
    • Appropriate changes to spatial_merge_unit, RoPE indexing, and window layout.
    • A new or heavily re-trained set of weights.
  2. A processor whose patch_size, temporal_patch_size, merge_size all match that new vision encoder.

  3. If you also change patch size in the diffusion DiT, its proj_out layer must be rebuilt with:

    nn.Linear(inner_dim, patch_size * patch_size * out_channels, bias=True)
    

    which changes dimensionality when patch_size changes. (Hugging Face)

  4. Then you would have to:

    • Train/fine-tune this new vision+dense stack.
    • Re-attach it to the Qwen-Image-Edit pipeline.
    • Only then train LoRA on top.

There is currently no public “Qwen-Image-Edit-2509 with patch_size=7” model or official guide. Doing this is a research project, not a config tweak.


5.4. Extra debugging tips (if you want to inspect the mismatch)

If you want to see the shapes that are going wrong:

from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
import torch

model_id = "Qwen/Qwen-Image-Edit-2509"  # original, unmodified

processor = AutoProcessor.from_pretrained(model_id)
text_encoder = AutoModelForCausalLM.from_pretrained(model_id)

image = Image.open("some_image.png").convert("RGB")

batch = processor(
    text=["test"],
    images=image,
    padding=True,
    return_tensors="pt",
)

print("pixel_values:", batch.pixel_values.shape)
print("image_grid_thw:", batch.image_grid_thw)

with torch.no_grad():
    outputs = text_encoder(
        input_ids=batch.input_ids,
        attention_mask=batch.attention_mask,
        pixel_values=batch.pixel_values,
        image_grid_thw=batch.image_grid_thw,
        output_hidden_states=True,
    )

With the official config, you will see that:

  • image_grid_thw values are consistent with patch_size=14, temporal_patch_size=2, merge_size=2. (Hugging Face)

If you then change patch_size=7 again, you can print the same shapes and you’ll see that grid_thw and the token counts no longer match the assumptions built into the vision transformer.

For low-level debugging of the CUDA assert itself, you can temporarily set:

export CUDA_LAUNCH_BLOCKING=1

to make the stack trace more synchronous, but this will only confirm what we already know: the failure happens at the gather into rotary_pos_emb using window_index.


6. Summary (key points)

  • Changing patch_size from 14 to 7 in processor/preprocessor_config.json breaks the contract between the image processor and the Qwen2.5-VL vision encoder used inside Qwen-Image-Edit-2509. (Hugging Face)
  • The processor now produces image_grid_thw for 7×7 patches, while the vision encoder is still built and trained for 14×14 patches and 2×2 merges.
  • This mismatch causes the computed window_index to contain indices that are out of bounds for the actual sequence length, triggering the CUDA vectorized_gather_kernel assertion at rotary_pos_emb = rotary_pos_emb[window_index, :, :]. (GitHub)
  • Training can appear to run because not all code paths / resolutions are exercised there, but the setup is architecturally inconsistent.
  • The only safe fix with current public checkpoints is to restore patch_size=14, re-load a clean model, and train your LoRA with that configuration. If you need more detail, adjust image resolution (min_pixels / max_pixels, height/width), not patch_size. (Hugging Face)

Suggested references (high-signal reading)

  • Qwen-Image GitHub issue about changing patch_size from 14 → 7 and hitting exactly your error. (GitHub)
  • Hugging Face Transformers docs for Qwen2-VL / Qwen2.5-VL processors (explains patch_size, temporal_patch_size, merge_size, pixel-based dynamic resolution). (Hugging Face)
  • Qwen2-VL / Qwen2.5-VL architectural notes and code walkthroughs explaining how 14×14 patches and 2×2 merges give ~28×28 effective visual tokens. (arXiv)
  • Qwen3-VL repo and qwen-vl-utils docs (shows model-specific image_patch_size: 14 for Qwen2.5-VL, 16 for Qwen3-VL). (GitHub)
  • Qwen-Image-Edit code snippets showing diffusion proj_out depends on patch_size * patch_size * out_channels, confirming patch size is baked into the model, not purely a preprocessor knob. (Hugging Face)