I noticed that PALIGemma 224x224 uses image_seq_length = 256
, and in the PALIGemma paper they also quote this. I can’t make sense of this, given that the use a patch size of 16x16, which implies 196 tokens, not 256.
Neither I, Gemini, ChatGPT or Claude could find the explanation in the Siglip or PALIGemma paper, and I’m finding the code tough to navigate. Can anyone explain this to me or point me to the file with the implementation?