hey everyone, I’m trying to use a model but I get a type error: TypeError: Cannot handle this data type: (1, 1, 1280, 3), |u1
My (relevant) code is:
from transformers import VideoMAEImageProcessor, VideoMAEForVideoClassification
import torchvision as tv
def predict_video_class(video):
    processor = VideoMAEImageProcessor.from_pretrained("MCG-NJU/videomae-large-finetuned-kinetics")
    model = VideoMAEForVideoClassification.from_pretrained("MCG-NJU/videomae-large-finetuned-kinetics")
    inputs = processor(video, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
    predicted_class_idx = logits.argmax(-1).item()
    return f"Predicted class: {model.config.id2label[predicted_class_idx]}"
video, audio, metadata = tv.io.read_video(video_path, pts_unit='sec')
print(predict_video_class(video))
the video file is some 30sec mp4.
I’ve tried printing the model to see which shape I need, but it isn’t clear to me from the output:
VideoMAEForVideoClassification(
  (videomae): VideoMAEModel(
    (embeddings): VideoMAEEmbeddings(
      (patch_embeddings): VideoMAEPatchEmbeddings(
        (projection): Conv3d(3, 1024, kernel_size=(2, 16, 16), stride=(2, 16, 16))
      )
    )
    (encoder): VideoMAEEncoder(
      (layer): ModuleList(
        (0-23): 24 x VideoMAELayer(
          (attention): VideoMAEAttention(
            (attention): VideoMAESelfAttention(
              (query): Linear(in_features=1024, out_features=1024, bias=False)
              (key): Linear(in_features=1024, out_features=1024, bias=False)
              (value): Linear(in_features=1024, out_features=1024, bias=False)
              (dropout): Dropout(p=0.0, inplace=False)
            )
            (output): VideoMAESelfOutput(
              (dense): Linear(in_features=1024, out_features=1024, bias=True)
              (dropout): Dropout(p=0.0, inplace=False)
            )
          )
          (intermediate): VideoMAEIntermediate(
            (dense): Linear(in_features=1024, out_features=4096, bias=True)
            (intermediate_act_fn): GELUActivation()
          )
          (output): VideoMAEOutput(
            (dense): Linear(in_features=4096, out_features=1024, bias=True)
            (dropout): Dropout(p=0.0, inplace=False)
          )
          (layernorm_before): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
          (layernorm_after): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
        )
      )
The current shape of my tensor is [844, 640, 1280, 3] in format [T, H, W, C] (time, height, width, color)
What shape should my tensor be in? How can I figure this out for myself in the future? What does the |u1 signify?