Issue with KOSMOS-2 encoding and decoding

Hi @ydshieh,

First of all I would like to say thank you for your work on Kosmos-2. I want to using Kosmos to for document AI. I am using trying to convert doclaynet dataset to embeddings. I read your code and research paper and found that first I need to convert bounding box to patch_XXXX thing. I tried that using processor but when I decode input_ids then I found miss match on the bounding box on the image.

test = processor(images = [example['image']], text = [text], bboxes = [float_val])
test_decode = processor.decode(test['input_ids'][0])

As you can see in the image that bounding boxes were overlapped and all.

Can you help me to figure this out?

Here is the code: Google Colab
Thank you so much.

Hi @Mit1208

If I understand correctly, the attached contains the bounding boxes obtained via the processing(s) of Kosmos2Processor

processed_text, entities = processor.post_process_generation(test_decode)

And you want to demonstrate via this image that the bounding boxes are not matching the original input of bounding boxes (where you specified via set_box then go through normalized_box(convert_box). Is this correct?

I can see your image width/height is set to 1025 (I didn’t verify however). Kosmos2Processor splits the image into a 32 x 32 grid. With the original input size 224, each cell is of siez 7x7. However, with your image size 1025, each cell will have size 32x32 which is quite large in document AI I believe.

In order to verify this, you can compare the original bboxes input agains the final processed/computed output bboxes, and see if their differences are in the ragen of 32 (or even 64). If the differences are all inside this range, I wouldn’t say there is something wrong in the code of Kosmos2Processor. In this case, it’s just the limitation of such processing.

If the differences are large than 32 or 64, something is likely to be wrong and I can take a more close look.

If you don’t mind to train a model from scratch, there are some arguments could be changed to modify the default properties of Kosmos2Processor. I can share more information if you are donw to this.

1 Like

Thank you so much @ydshieh for lookin into this issue. Yes your understanding is correct.

You are right that the image size is 1025. About bounding boxes - labeled bounding box is ranging from 0-1025. I normalized it to 0-1 from Kosmos2Processor. I tried resizing image to 224 but found same results.

You mentioned below, how can I do that?

In order to verify this, you can compare the original bboxes input agains the final processed/computed output bboxes, and see if their differences are in the ragen of 32 (or even 64).

I think training it from scratch is difficult because I don’t have GPUs. Do you think I can train the model on free Colab?

It would be helpuful if you can:

  • check first what I suggest above

In order to verify this, you can compare the original bboxes input agains the final processed/computed output bboxes, and see if their differences are in the ragen of 32 (or even 64). …

For

I tried resizing image to 224 but found same results.

Could you add the extra cells (from the end of the original notebook) showing this

However, I think with 224 document image, the word/line gaps are very smaller too. The ideal situation is that word/line gaps are large enough (compared to the word font size), so a bit off of the bounding bboxes won’t have undesired visual effect.

I think training it from scratch is difficult because I don’t have GPUs. Do you think I can train the model on free Colab?

First, this model is not trained on documents, I would doubt if it will work well (even with finetuning).

train the model from scratch on free Colab → it probably could run but will be quite slow (data processing, training speed etc.)

I have added 224*224 resizing code and output, it’s the same as the normal image. Can you see if my bbox conversion from (x, y, w, h) to [x, y, x+w, y+h] is correct and do I need to repeat that step after decoding from Kosmos2Processor?

It seems I am very close but missing something small.

I just wanted to see how this model reacts to the document compared to LayoutLMv3.

Hi , thanks. I see you post the desired outputs. Looking both visually, I think it’s really just the limitation of Kosmos2Processor (which using cells in a 32x32 grid), which works well in a common images with smaller size. For documents, the cell approaches isn’t good as we need higher precision of the locations.

Hi, is there a workaround for this like, creating more patches to handle this? I don’t have a lot of experience other than fine-tuning models but I can give it a shot. Can you refer to some documentation or tutorial which I can follow to learn and overcome this issue?

Thanks.

Yes, you can specify num_patch_index_tokens when creating Kosmos2Processor. (I think it works when we use from_pretrained and passing num_patch_index_tokens)

And while doing process(es), we need to pass num_patches_per_side.

(see clean_text_and_extract_entities_with_bboxes)
(post_process_generation doesn’t have this argument, which I should probably update this method)

1 Like

Let me know how it goes!

Thanks, sure I will keep you posted.

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.