hey everyone, I’m trying to use a model but I get a type error: TypeError: Cannot handle this data type: (1, 1, 1280, 3), |u1
My (relevant) code is:
from transformers import VideoMAEImageProcessor, VideoMAEForVideoClassification
import torchvision as tv
def predict_video_class(video):
processor = VideoMAEImageProcessor.from_pretrained("MCG-NJU/videomae-large-finetuned-kinetics")
model = VideoMAEForVideoClassification.from_pretrained("MCG-NJU/videomae-large-finetuned-kinetics")
inputs = processor(video, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
predicted_class_idx = logits.argmax(-1).item()
return f"Predicted class: {model.config.id2label[predicted_class_idx]}"
video, audio, metadata = tv.io.read_video(video_path, pts_unit='sec')
print(predict_video_class(video))
the video file is some 30sec mp4.
I’ve tried printing the model to see which shape I need, but it isn’t clear to me from the output:
VideoMAEForVideoClassification(
(videomae): VideoMAEModel(
(embeddings): VideoMAEEmbeddings(
(patch_embeddings): VideoMAEPatchEmbeddings(
(projection): Conv3d(3, 1024, kernel_size=(2, 16, 16), stride=(2, 16, 16))
)
)
(encoder): VideoMAEEncoder(
(layer): ModuleList(
(0-23): 24 x VideoMAELayer(
(attention): VideoMAEAttention(
(attention): VideoMAESelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=False)
(key): Linear(in_features=1024, out_features=1024, bias=False)
(value): Linear(in_features=1024, out_features=1024, bias=False)
(dropout): Dropout(p=0.0, inplace=False)
)
(output): VideoMAESelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
)
(intermediate): VideoMAEIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
(intermediate_act_fn): GELUActivation()
)
(output): VideoMAEOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(layernorm_before): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
(layernorm_after): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
)
)
The current shape of my tensor is [844, 640, 1280, 3] in format [T, H, W, C] (time, height, width, color)
What shape should my tensor be in? How can I figure this out for myself in the future? What does the |u1 signify?