TrOCR fine-tuning

Hello, I fine tune microsoft/trocr-small-stage1 for my dataset. When I want to test my model on a validation sample, the inference time is several hours, while microsoft/trocr-small-handwritten сopes with it in 10 minutes. What could be the problem, is it possible to speed up the inference somehow?

i am too thinking of training TrOCR for Kannada, i was able to find a BERT model for kannada at huggingface. how do i generate a dataset that can be used for training TrOCR ???

For training TrOCR, you just need a dataset of (image, text) pairs.

In order to speed up inference time, you can either (1) run on GPU (2) look into optimizations such as ONNX, quantization, etc.

when you say image text pairs, can i provide an entire image of a page which have multiple lines and a corresponding .txt file with the exact same list in the same position. does TrOCR suport multilines page inputs ?

if not how do i produce a line by line dataset ?

TrOCR itself was trained on single line-text images. This was a choice made by Microsoft. They used a text detector to get individual single line-text images from documents. Examples which you can use are CRAFT or text detectors available in DocTR.

Of course, nothing stops you from training a VisionEncoderDecoderModel that takes in an entire PDF document and returns the entire text appearing into it.