I have been using LLMs for 20 months full time, but am new to Vision Language Models.
I have about 10,000 high res images (48mpx) of structures I want to detect defects in. The textures in the images vary and can be grouped into about 12 categories. I figure a 2 stage approach will help. I feel I might have to break up my 48mpx images into smaller tiles for the model to cope with the no. of tokens. Then the issue of overlap to detect defects that span the tiles comes into play. 1st stage use a PEFT fine tuned VLM to classify the image into one of 12 categories on texture type. Stage 2 have several PEFT fine tuned for 1 or a couple of the different category images to detect the defects. This is likely to be run on batches of images overnight, so token/sec is not too important.
I am looking for suggestions about model selection and also critique of my approach above. References to research papers relevant are most welcome.
Thank you!