I am experimenting with having a TrOCR model read text from receipts to then have a secondary model summarize important fields like total, items purchased, and store.
Trying to use this model TrOCRProcessor.from_pretrained(“microsoft/trocr-large-printed”) using code like:
# Dataloader __getitem__
image = Image.open(self.image_paths[idx]).convert('RGB')
image = torch.squeeze(self.processor(image, return_tensors="pt").pixel_values).to('cuda')
# Inference
generated_ids = trocr.model.generate(next(iter(dataloader_train))['image'])
image_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(image_text)
By default, the predictions I get for a batch of receipts are:
[‘TEL’, ‘ITEM’, ‘L’, ‘TOTAL’, ‘CASHIER’, ‘:’, ‘3’, ‘TEL’, ‘1’, ‘1’, ‘TEL’, ‘E’, ‘TEL’, ‘R’]
It seems like the number of tokens being predicted is substantially restricted. The generated_ids for one image in the batch are only 5 tokens long e.g.
[ 2, 565, 3721, 2, 1]
I tried increasing the number of max_new_tokens which changed the generated_ids values but not their number.
generated_ids = trocr.model.generate(next(iter(dataloader_train))['image'], max_new_tokens=100)
Tried increasing the number of min_new_tokens which increases the generated length but does yield gibberish:
['LITTLE RULED WITH REPRICE WITH REPRIMULATE WITH REPRICE WITH REPRIMULANTED RETIRED…
Can someone please explain how I increase the amount of text that can be predicted from the image, and what the min/max tokens do, if it is something different?
P.S.: The model initialization does print a warning
“Some weights of VisionEncoderDecoderModel were not initialized from the model checkpoint at microsoft/trocr-large-printed and are newly initialized: [‘encoder.pooler.dense.bias’, ‘encoder.pooler.dense.weight’]”.
Should I be dropping those layers or using a different model? I assumed this is already fine-tuned.
P.P.S: Since receipts have text in very specific locations, it seems like fine tuning the model to update the attention encoding would be a good idea. I have labels for things like store and items purchased but not all of the text in the receipt. Is it reasonable to try to fine tune the image encoder in this situation?