Use only encoder to generate the image embeddings in a VisionEncoderDecoderModel such as Donut

I want to use the pre-trained checkpoint of the Donut model on my document images to generate the embeddings for each image that I can use in a subsequent pipeline. I don’t want the decoder output, just need the image embeddings that are fed into the decoder to generate the text completion. Any ideas or code samples on how I can achieve that?

1 Like

I am also searching for this answer, let me know if you get the solution for this.