Wrong tensor shape when using a model: TypeError: Cannot handle this data type: (1, 1, 1280, 3), |u1

levavft · January 8, 2024, 7:12am

hey everyone, I’m trying to use a model but I get a type error: TypeError: Cannot handle this data type: (1, 1, 1280, 3), |u1
My (relevant) code is:

from transformers import VideoMAEImageProcessor, VideoMAEForVideoClassification
import torchvision as tv

def predict_video_class(video):
    processor = VideoMAEImageProcessor.from_pretrained("MCG-NJU/videomae-large-finetuned-kinetics")
    model = VideoMAEForVideoClassification.from_pretrained("MCG-NJU/videomae-large-finetuned-kinetics")
    inputs = processor(video, return_tensors="pt")

    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits

    predicted_class_idx = logits.argmax(-1).item()
    return f"Predicted class: {model.config.id2label[predicted_class_idx]}"

video, audio, metadata = tv.io.read_video(video_path, pts_unit='sec')
print(predict_video_class(video))

the video file is some 30sec mp4.

I’ve tried printing the model to see which shape I need, but it isn’t clear to me from the output:

VideoMAEForVideoClassification(
  (videomae): VideoMAEModel(
    (embeddings): VideoMAEEmbeddings(
      (patch_embeddings): VideoMAEPatchEmbeddings(
        (projection): Conv3d(3, 1024, kernel_size=(2, 16, 16), stride=(2, 16, 16))
      )
    )
    (encoder): VideoMAEEncoder(
      (layer): ModuleList(
        (0-23): 24 x VideoMAELayer(
          (attention): VideoMAEAttention(
            (attention): VideoMAESelfAttention(
              (query): Linear(in_features=1024, out_features=1024, bias=False)
              (key): Linear(in_features=1024, out_features=1024, bias=False)
              (value): Linear(in_features=1024, out_features=1024, bias=False)
              (dropout): Dropout(p=0.0, inplace=False)
            )
            (output): VideoMAESelfOutput(
              (dense): Linear(in_features=1024, out_features=1024, bias=True)
              (dropout): Dropout(p=0.0, inplace=False)
            )
          )
          (intermediate): VideoMAEIntermediate(
            (dense): Linear(in_features=1024, out_features=4096, bias=True)
            (intermediate_act_fn): GELUActivation()
          )
          (output): VideoMAEOutput(
            (dense): Linear(in_features=4096, out_features=1024, bias=True)
            (dropout): Dropout(p=0.0, inplace=False)
          )
          (layernorm_before): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
          (layernorm_after): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
        )
      )

The current shape of my tensor is [844, 640, 1280, 3] in format [T, H, W, C] (time, height, width, color)

What shape should my tensor be in? How can I figure this out for myself in the future? What does the |u1 signify?

nielsr · January 8, 2024, 8:08am

Hi,

VideoMAE models are trained with a certain number of frames, which you can see from model.config.num_frames. This is typically 16 or 32. In other words, you’ll need to sample 16 or 32 frames from the video, which are then provided to the model. See the example code snippet for details regarding sampling frames.

Also, the channels are first rather than last. This is documented here: VideoMAE.

levavft · January 9, 2024, 1:09am

Thank you very much!

system · January 9, 2024, 1:10pm

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Model input shape doesnt match Beginners	2	21	April 12, 2025
Resolve TypeError: expected Tensor as element 1 in argument 0, but got NoneType 🤗Transformers	0	407	June 21, 2024
Error while loading the model using safe tensors 🤗Transformers	0	631	July 11, 2023
Model doesn't accept int32 Beginners	0	97	April 29, 2024
Attribute error "nonetype object has no shape" Beginners	4	25	June 11, 2025

Wrong tensor shape when using a model: TypeError: Cannot handle this data type: (1, 1, 1280, 3), |u1

Related topics