Bug Description
Hi all - when using Sam3VideoModel with text prompts, the model crashes because CLIPTextModelOutput objects are passed where tensors are expected, causing:
AttributeError: 'CLIPTextModelOutput' object has no attribute 'shape'
at transformers/masking_utils.py:912 in create_bidirectional_mask().
Root Cause
The SAM3Video pipeline stores full CLIPTextModelOutput objects in inference_session.prompt_embeddings, but Sam3Model.forward() expects raw tensors for the text_embeds parameter. The CLIP output’s pooler_output attribute needs to be extracted before passing to the model.
Reproduction
from transformers import Sam3VideoModel, Sam3VideoProcessor
from transformers.video_utils import load_video
import torch
model = Sam3VideoModel.from_pretrained("facebook/sam3").to("cuda", dtype=torch.bfloat16)
processor = Sam3VideoProcessor.from_pretrained("facebook/sam3")
video_frames, _ = load_video("https://huggingface.co/datasets/hf-internal-testing/sam2-fixtures/resolve/main/bedroom.mp4")
inference_session = processor.init_video_session(
video=video_frames,
inference_device="cuda",
processing_device="cpu",
video_storage_device="cpu",
dtype=torch.bfloat16,
)
inference_session = processor.add_text_prompt(
inference_session=inference_session,
text="person",
)
# Crashes here
for model_outputs in model.propagate_in_video_iterator(inference_session=inference_session):
break
Proposed Fix
In transformers/models/sam3/modeling_sam3.py, add this to the beginning of Sam3Model.forward():
def forward(self, pixel_values=None, vision_embeds=None, input_ids=None,
attention_mask=None, text_embeds=None, input_boxes=None,
input_boxes_labels=None, **kwargs):
# Extract tensor from CLIPTextModelOutput if needed
if text_embeds is not None and hasattr(text_embeds, 'pooler_output'):
text_embeds = text_embeds.pooler_output
# ... rest of method
Environment
- transformers: 5.0.0
- torch: 2.4.1+cu121
- python: 3.10
Workaround
import transformers.models.sam3.modeling_sam3 as sam3_module
original_forward = sam3_module.Sam3Model.forward
def patched_forward(self, pixel_values=None, vision_embeds=None, input_ids=None,
attention_mask=None, text_embeds=None, input_boxes=None,
input_boxes_labels=None, **kwargs):
if text_embeds is not None and hasattr(text_embeds, 'pooler_output'):
text_embeds = text_embeds.pooler_output
return original_forward(self, pixel_values, vision_embeds, input_ids,
attention_mask, text_embeds, input_boxes,
input_boxes_labels, **kwargs)
sam3_module.Sam3Model.forward = patched_forward
# Then use Sam3VideoModel normally
Didn’t want to open a bug on github before checking if I’m missing something obvious!