How to get a video embedding from a pretrained transformer?

I would like to use a videotransformer (like video.MAE) to get an embedding of a video (equivalent to the CLS token). Using the demo from hugging face, i get :
outputs.last_hidden_state.shape = torch.Size([1, 1568, 768])
I though that the first one was the CLS token but 1567 is a prime number and therefore should not correspond to the patched embeddings of the video.
Can someone help me ?