What should be the correct feature shape for image - extracted using Swin Transformer?

Hi ! I am new to using huggingface transformers module.
I am facing a problem, I hope someone can help me out.

What I am trying to do: I am building a gender classifier - 5k images with 32x32 size. They are all RGB.
I am using SwinForImageClassification. I was able to train and get a 80% - ish accuracy.
Now I am trying to get the image features only. I tried using SwinModel for extraction the feature only (After reading this : Using trasnsformer to get image features)

I am getting Feature shape: [494, 49, 768] on Training set with size: 3952
According to the example found here (Swin Transformer) the shape seems to be ok to me.

The Problem I am facing: My supervisor is saying that for 5k image the Feature shape should be like [5000,1024] for Swin Base Model.

How do I achieve this ? Any suggestions ?