Wrong tensor shape when using a model: TypeError: Cannot handle this data type: (1, 1, 1280, 3), |u1

Hi,

VideoMAE models are trained with a certain number of frames, which you can see from model.config.num_frames. This is typically 16 or 32. In other words, you’ll need to sample 16 or 32 frames from the video, which are then provided to the model. See the example code snippet for details regarding sampling frames.

Also, the channels are first rather than last. This is documented here: VideoMAE.

1 Like