Is Annotation Necessary for the Multi Model LLM to Learn the Details in the Image? or beneficial ?
For example, when I provide an image and ask, there is a speed sign on the left side, there are children eating ice cream on the right side, the stork is flying above, the weather is sunny, the ground is asphalt and pavement, etc. Do I need to anotate these for the model to learn them?
I have seen some fine tuning examples in which they only use image and text without annotation.
1 Like
This is a paper about VLM, not multimodal LLM, but it seems that if location is important in the output, annotated data is better. The structure of the model itself should be designed with this in mind.
Otherwise, generally speaking, highly accurate training can be done with just image and text data pairs. In particular, if the resolution of the training images is high enough, it seems to be highly effective.
It is a mundane conclusion, but I think which training method is more efficient depends on the use case of the model.