Best model for image object comparison?

Hmm, I agree that separating the processing would be more reliable. It depends on which object to designate as the main object. If you want to use a single model, I think the CLIP model would be the closest approach, but it would be difficult to obtain a score using this method.