SAM3Video: CLIPTextModelOutput passed as tensor causes crash with text prompts

Bug Description

Hi all - when using Sam3VideoModel with text prompts, the model crashes because CLIPTextModelOutput objects are passed where tensors are expected, causing:

AttributeError: 'CLIPTextModelOutput' object has no attribute 'shape'

at transformers/masking_utils.py:912 in create_bidirectional_mask().

Root Cause

The SAM3Video pipeline stores full CLIPTextModelOutput objects in inference_session.prompt_embeddings, but Sam3Model.forward() expects raw tensors for the text_embeds parameter. The CLIP output’s pooler_output attribute needs to be extracted before passing to the model.

Reproduction

from transformers import Sam3VideoModel, Sam3VideoProcessor
from transformers.video_utils import load_video
import torch

model = Sam3VideoModel.from_pretrained("facebook/sam3").to("cuda", dtype=torch.bfloat16)
processor = Sam3VideoProcessor.from_pretrained("facebook/sam3")

video_frames, _ = load_video("https://huggingface.co/datasets/hf-internal-testing/sam2-fixtures/resolve/main/bedroom.mp4")

inference_session = processor.init_video_session(
    video=video_frames,
    inference_device="cuda",
    processing_device="cpu",
    video_storage_device="cpu",
    dtype=torch.bfloat16,
)

inference_session = processor.add_text_prompt(
    inference_session=inference_session,
    text="person",
)

# Crashes here
for model_outputs in model.propagate_in_video_iterator(inference_session=inference_session):
    break

Proposed Fix

In transformers/models/sam3/modeling_sam3.py, add this to the beginning of Sam3Model.forward():

def forward(self, pixel_values=None, vision_embeds=None, input_ids=None, 
            attention_mask=None, text_embeds=None, input_boxes=None, 
            input_boxes_labels=None, **kwargs):
    
    # Extract tensor from CLIPTextModelOutput if needed
    if text_embeds is not None and hasattr(text_embeds, 'pooler_output'):
        text_embeds = text_embeds.pooler_output
    
    # ... rest of method

Environment

  • transformers: 5.0.0
  • torch: 2.4.1+cu121
  • python: 3.10

Workaround

import transformers.models.sam3.modeling_sam3 as sam3_module

original_forward = sam3_module.Sam3Model.forward

def patched_forward(self, pixel_values=None, vision_embeds=None, input_ids=None,
                    attention_mask=None, text_embeds=None, input_boxes=None, 
                    input_boxes_labels=None, **kwargs):
    if text_embeds is not None and hasattr(text_embeds, 'pooler_output'):
        text_embeds = text_embeds.pooler_output
    return original_forward(self, pixel_values, vision_embeds, input_ids, 
                          attention_mask, text_embeds, input_boxes, 
                          input_boxes_labels, **kwargs)

sam3_module.Sam3Model.forward = patched_forward

# Then use Sam3VideoModel normally

Didn’t want to open a bug on github before checking if I’m missing something obvious!

1 Like