Is the process of annotation imperative for multi-modal large language models (LLMs) to acquire nuanced understanding of image details, or can it be deemed merely advantageous?
For instance, when I present an image and query specifics such as the presence of a speed sign on the left, children enjoying ice cream on the right, a stork in flight above, sunny weather conditions, and the asphalt pavement beneath, must I meticulously annotate these elements for the model to effectively assimilate them?
I have encountered numerous fine-tuning exemplars where only image-text pairings are employed without any form of annotation. This raises the question: does annotation serve as a critical mechanism for learning, or can models thrive on unannotated data?