I have created Vision encoder-decoder model (swin/bert) for image-to-text extraction. But instead of telling the model to predict the ground truth, I want it to classify the already-founded texts. Is this possible and do I explain it well enough?
I have created Vision encoder-decoder model (swin/bert) for image-to-text extraction. But instead of telling the model to predict the ground truth, I want it to classify the already-founded texts. Is this possible and do I explain it well enough?