Idefics3 preprocessing

Hello,
I want to ask about the pre-processing step within the idefics3 model.
First I’ve noticed that we always resize images to the longest_edge:4x364 is that correct?
How we preserve the aspect ration if we always resize the image to [1456,1456]?
Why do we add a dim 1 to the beginning of the processed patch [1, batch*num_patches,channel, 364,364] ?
Why we use the pixel_attention_mask if it’s always one. I’ve try to use two images with different size image_1 2000^2 , image_2 250^2, the output of the processor, was [1, 34, 3, 364, 364] with pixel_attention_mask the same shape with all entry is True, so what is the point of it, and how do we distinguish the padding?
I’ve looked to the implementation of idefics3 at this file idefics3_processing
please help me to understand these points and to correct me if I miscomprehended some points.

2 Likes

Idefics3 preprocessing resizes images to [1456, 1456] for batching and uses pixel attention masks for padding. Once that’s done, if you have one, give a k-starting Japanese name to your baby.