Multimodal training

celalp · March 20, 2025, 8:40pm

Hi,

I have a dataset that consists of images, their captions (they are scientific figures) and some excerpts from the paper main text that references the figure. The goal of this is to for a given figure and its caption, can we understand the figure (the text in the paper). This is different from an image captioning problem but more of a reasoning problem.

I would appreciate any pointers on how to train on image-text pairs as input and text as output. In this instance the figure captions are quite important because many figures look alike even within a paper and the figure caption is important to differentiate between them.

Thanks for all the suggestions in advance.

John6666 · March 21, 2025, 7:20am

In your case, I think you would want to combine VLM and LLM to perform VQA-like tasks. You could train each lightweight model separately and then combine them, or some high-performance VLMs already have quite LLM-like capabilities.

However, I think a model like LLaVA, which is a combination of VLM and LLM, would be more suitable.

VLMs

Other approaches by Hugging Chat

Based on the sources provided, here are effective approaches and models for training on image-text pairs to understand scientific figures and generate reasoned text outputs:

1. Contrastive Learning with Captioning Models

Model: CoCa (Contrastive Captioner) [1]
- CoCa is a foundation model that leverages both contrastive and captioning losses. It aligns images and text by learning similar representations for related image-text pairs and generates descriptive captions.
- Key Features:
  - Simultaneous learning of cross-modal alignment and caption generation.
  - Effective for nuanced understanding of visual-text relationships.
- Use Case: Ideal for your dataset, as it can handle image-text pairs and generate context-aware captions.
Model: Mistral 7B [3]
- A large language model fine-tuned for image captioning tasks. It focuses on generating human-like captions by understanding complex scenes.
- Key Features:
  - Sophisticated scene understanding and natural language description.
  - Can be adapted for scientific figures by training on your dataset.

2. Explicit Image Caption Reasoning (ECR)

Model: ECRMM (Explicit Caption Reasoning Multimodal Model) [4]
- ECR employs inference chaining to analyze images deeply and generate detailed captions. It is particularly effective for complex scenes and fine-grained information.
- Key Features:
  - Focuses on reasoning and semantic parsing for accurate and detailed descriptions.
  - Fine-tuned on datasets like ICICD, which includes images, captions, and textual context.
- Use Case: Suitable for your dataset, as it emphasizes understanding the relationships between images, captions, and textual context.

3. Contrastive Learning and Multi-Modal Training

Approach: Contrastive learning [2][4]
- Train a model to align images and text by encouraging similar representations for related pairs. This is particularly useful when figure captions are critical for differentiation.
- Implementation:
  - Use pre-trained models like CoCa or Mistral 7B and fine-tune them on your dataset.
  - Incorporate the figure captions as part of the training input to guide the model toward accurate and context-aware reasoning.
Model: Multi-Modal Transformers [2]
- Models like MAsked Pre-training (MAST) can process images and text together, improving cross-modal understanding.
- Key Features:
  - Handles image-text pairs as input and generates text output aligned with the visual context.
  - Effective for reasoning tasks where captions are central to understanding.

Recommendations

Start with CoCa for its strong performance in image-text alignment and caption generation.
Fine-tune Mistral 7B or ECRMM on your dataset to leverage advanced scene understanding and reasoning capabilities.
Use contrastive learning to align images with their captions, especially when figures are visually similar.

References

[1] Learn CoCa: Image-Text Foundation Models with Contrastive Captioners [Source]
[2] Multimodal training - Transformers - Hugging Face Forums [Source]
[3] Image Captioning with Mistral 7B LLM: A Hands-on Guide [Source]
[4] Explicit Image Caption Reasoning (ECR) [Source]

John6666 · March 21, 2025, 7:26am

Training Tips

celalp · March 21, 2025, 3:21pm

Oh wow thank @John6666 for the detailed answers. I will check the models and references out.

system · March 25, 2025, 7:38pm

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Multimodal datasets and corresponding models Beginners	2	76	March 12, 2025
Incremental learning for image captioning 🤗Transformers	3	84	October 1, 2024
Any Multi Modal LLMs that take direct pdf + text as input? 🤗Transformers	2	1881	October 10, 2024
Docling image captioning best VLM Models	2	136	April 25, 2025
Extracting metadata from images using LLMs Beginners	2	32	June 18, 2025