How to get a fixed size embedding from the last hidden state of vision models?

I am using Vision models (ViT, BEiT) as the image encoder and BERT as the Text encoder. I am trying to get a fixed size 1D representation of the last_hidden_state of BEiT (Something like what CLIP obtains) to concatenate with a BERT embedding to feed into an MLP head.

I could use the pooler_output of the image but it doesn’t seem to preserve certain spatial nuances hence I would like to use the last_hidden_state

@nielsr Please advice