Issue with KOSMOS-2 encoding and decoding

Hi @Mit1208

If I understand correctly, the attached contains the bounding boxes obtained via the processing(s) of Kosmos2Processor

processed_text, entities = processor.post_process_generation(test_decode)

And you want to demonstrate via this image that the bounding boxes are not matching the original input of bounding boxes (where you specified via set_box then go through normalized_box(convert_box). Is this correct?

I can see your image width/height is set to 1025 (I didn’t verify however). Kosmos2Processor splits the image into a 32 x 32 grid. With the original input size 224, each cell is of siez 7x7. However, with your image size 1025, each cell will have size 32x32 which is quite large in document AI I believe.

In order to verify this, you can compare the original bboxes input agains the final processed/computed output bboxes, and see if their differences are in the ragen of 32 (or even 64). If the differences are all inside this range, I wouldn’t say there is something wrong in the code of Kosmos2Processor. In this case, it’s just the limitation of such processing.

If the differences are large than 32 or 64, something is likely to be wrong and I can take a more close look.

If you don’t mind to train a model from scratch, there are some arguments could be changed to modify the default properties of Kosmos2Processor. I can share more information if you are donw to this.

1 Like