Wrong tensor shape when using a model: TypeError: Cannot handle this data type: (1, 1, 1280, 3), |u1

hey everyone, I’m trying to use a model but I get a type error: TypeError: Cannot handle this data type: (1, 1, 1280, 3), |u1
My (relevant) code is:

from transformers import VideoMAEImageProcessor, VideoMAEForVideoClassification
import torchvision as tv

def predict_video_class(video):
    processor = VideoMAEImageProcessor.from_pretrained("MCG-NJU/videomae-large-finetuned-kinetics")
    model = VideoMAEForVideoClassification.from_pretrained("MCG-NJU/videomae-large-finetuned-kinetics")
    inputs = processor(video, return_tensors="pt")

    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits

    predicted_class_idx = logits.argmax(-1).item()
    return f"Predicted class: {model.config.id2label[predicted_class_idx]}"

video, audio, metadata = tv.io.read_video(video_path, pts_unit='sec')

the video file is some 30sec mp4.

I’ve tried printing the model to see which shape I need, but it isn’t clear to me from the output:

  (videomae): VideoMAEModel(
    (embeddings): VideoMAEEmbeddings(
      (patch_embeddings): VideoMAEPatchEmbeddings(
        (projection): Conv3d(3, 1024, kernel_size=(2, 16, 16), stride=(2, 16, 16))
    (encoder): VideoMAEEncoder(
      (layer): ModuleList(
        (0-23): 24 x VideoMAELayer(
          (attention): VideoMAEAttention(
            (attention): VideoMAESelfAttention(
              (query): Linear(in_features=1024, out_features=1024, bias=False)
              (key): Linear(in_features=1024, out_features=1024, bias=False)
              (value): Linear(in_features=1024, out_features=1024, bias=False)
              (dropout): Dropout(p=0.0, inplace=False)
            (output): VideoMAESelfOutput(
              (dense): Linear(in_features=1024, out_features=1024, bias=True)
              (dropout): Dropout(p=0.0, inplace=False)
          (intermediate): VideoMAEIntermediate(
            (dense): Linear(in_features=1024, out_features=4096, bias=True)
            (intermediate_act_fn): GELUActivation()
          (output): VideoMAEOutput(
            (dense): Linear(in_features=4096, out_features=1024, bias=True)
            (dropout): Dropout(p=0.0, inplace=False)
          (layernorm_before): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
          (layernorm_after): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)

The current shape of my tensor is [844, 640, 1280, 3] in format [T, H, W, C] (time, height, width, color)

What shape should my tensor be in? How can I figure this out for myself in the future? What does the |u1 signify?


VideoMAE models are trained with a certain number of frames, which you can see from model.config.num_frames. This is typically 16 or 32. In other words, you’ll need to sample 16 or 32 frames from the video, which are then provided to the model. See the example code snippet for details regarding sampling frames.

Also, the channels are first rather than last. This is documented here: VideoMAE.

1 Like

Thank you very much!

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.