Multi modal LLM fine tuning (image annotation)

Cagatayd · October 5, 2024, 11:38am

Is Annotation Necessary for the Multi Model LLM to Learn the Details in the Image? or beneficial ?

For example, when I provide an image and ask, there is a speed sign on the left side, there are children eating ice cream on the right side, the stork is flying above, the weather is sunny, the ground is asphalt and pavement, etc. Do I need to anotate these for the model to learn them?

I have seen some fine tuning examples in which they only use image and text without annotation.

John6666 · October 8, 2024, 9:20am

This is a paper about VLM, not multimodal LLM, but it seems that if location is important in the output, annotated data is better. The structure of the model itself should be designed with this in mind.

Otherwise, generally speaking, highly accurate training can be done with just image and text data pairs. In particular, if the resolution of the training images is high enough, it seems to be highly effective.

It is a mundane conclusion, but I think which training method is more efficient depends on the use case of the model.

Topic		Replies	Views
Exploring the Necessity of Annotation in Multi-Modal LLM Fine-Tuning for Enhanced Image Comprehension Beginners	1	61	October 8, 2024
Multimodal training 🤗Transformers	4	68	March 21, 2025
Fine-tunening a multimodal model Beginners	4	5329	December 25, 2024
Multimodal LLM with Image and Text sequentially in its prompt 🤗Transformers	2	12490	January 1, 2024
Docling image captioning best VLM Models	2	174	April 25, 2025

Multi modal LLM fine tuning (image annotation)

Related topics