What controls the number of tokens for decoder sentence generation?

bdzyubak · June 3, 2024, 9:23pm

I am experimenting with having a TrOCR model read text from receipts to then have a secondary model summarize important fields like total, items purchased, and store.

Trying to use this model TrOCRProcessor.from_pretrained(“microsoft/trocr-large-printed”) using code like:

        # Dataloader __getitem__
        image = Image.open(self.image_paths[idx]).convert('RGB')
        image = torch.squeeze(self.processor(image, return_tensors="pt").pixel_values).to('cuda')

        # Inference
        generated_ids = trocr.model.generate(next(iter(dataloader_train))['image'])
        image_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
        print(image_text)

By default, the predictions I get for a batch of receipts are:

[‘TEL’, ‘ITEM’, ‘L’, ‘TOTAL’, ‘CASHIER’, ‘:’, ‘3’, ‘TEL’, ‘1’, ‘1’, ‘TEL’, ‘E’, ‘TEL’, ‘R’]

It seems like the number of tokens being predicted is substantially restricted. The generated_ids for one image in the batch are only 5 tokens long e.g.

[ 2, 565, 3721, 2, 1]

I tried increasing the number of max_new_tokens which changed the generated_ids values but not their number.

generated_ids = trocr.model.generate(next(iter(dataloader_train))['image'], max_new_tokens=100)

Tried increasing the number of min_new_tokens which increases the generated length but does yield gibberish:

['LITTLE RULED WITH REPRICE WITH REPRIMULATE WITH REPRICE WITH REPRIMULANTED RETIRED…

Can someone please explain how I increase the amount of text that can be predicted from the image, and what the min/max tokens do, if it is something different?

P.S.: The model initialization does print a warning

“Some weights of VisionEncoderDecoderModel were not initialized from the model checkpoint at microsoft/trocr-large-printed and are newly initialized: [‘encoder.pooler.dense.bias’, ‘encoder.pooler.dense.weight’]”.

Should I be dropping those layers or using a different model? I assumed this is already fine-tuned.

P.P.S: Since receipts have text in very specific locations, it seems like fine tuning the model to update the attention encoding would be a good idea. I have labels for things like store and items purchased but not all of the text in the receipt. Is it reasonable to try to fine tune the image encoder in this situation?

Topic		Replies	Views
TrOCR repeated generation Beginners	3	1310	November 30, 2021
TrOCR inference Beginners	1	426	November 24, 2021
TrOCR - inference on images in parallel Beginners	3	686	December 13, 2023
Prevent repeat tokens in GPT2LMHeadModel text generation with max_new_tokens=1 Beginners	0	1116	November 19, 2021
T5 decoder predicting tokens even after hitting end of sequence token, i.e </s> 🤗Transformers	4	328	February 26, 2024

What controls the number of tokens for decoder sentence generation?

Related topics