Is it possible to use Vision Encoder Decoder model for extracting text in document and then classifying the extracted texts

I have created Vision encoder-decoder model (swin/bert) for image-to-text extraction. But instead of telling the model to predict the ground truth, I want it to classify the already-founded texts. Is this possible and do I explain it well enough?