Issue with KOSMOS-2 encoding and decoding

ydshieh · January 24, 2024, 1:59pm

If I understand correctly, the attached contains the bounding boxes obtained via the processing(s) of Kosmos2Processor

processed_text, entities = processor.post_process_generation(test_decode)

And you want to demonstrate via this image that the bounding boxes are not matching the original input of bounding boxes (where you specified via set_box then go through normalized_box(convert_box). Is this correct?

I can see your image width/height is set to 1025 (I didn’t verify however). Kosmos2Processor splits the image into a 32 x 32 grid. With the original input size 224, each cell is of siez 7x7. However, with your image size 1025, each cell will have size 32x32 which is quite large in document AI I believe.

In order to verify this, you can compare the original bboxes input agains the final processed/computed output bboxes, and see if their differences are in the ragen of 32 (or even 64). If the differences are all inside this range, I wouldn’t say there is something wrong in the code of Kosmos2Processor. In this case, it’s just the limitation of such processing.

If the differences are large than 32 or 64, something is likely to be wrong and I can take a more close look.

If you don’t mind to train a model from scratch, there are some arguments could be changed to modify the default properties of Kosmos2Processor. I can share more information if you are donw to this.

Topic		Replies	Views
Issue on Kosmos-2 model training on new dataset 🤗Transformers	3	437	February 25, 2024
Kosmos-2 Fine tuning 🤗Transformers	41	1909	August 19, 2024
Owl-v2 bounding box misalignment problem Beginners	7	1269	February 5, 2024
LayoutLMv3 inference - bboxes are incorrect 🤗Transformers	0	112	May 10, 2024
Owl-Vit postprocess API bbox conversion Beginners	5	342	February 9, 2024

Issue with KOSMOS-2 encoding and decoding

Related topics