I have a dataset that consists of images, their captions (they are scientific figures) and some excerpts from the paper main text that references the figure. The goal of this is to for a given figure and its caption, can we understand the figure (the text in the paper). This is different from an image captioning problem but more of a reasoning problem.
I would appreciate any pointers on how to train on image-text pairs as input and text as output. In this instance the figure captions are quite important because many figures look alike even within a paper and the figure caption is important to differentiate between them.
In your case, I think you would want to combine VLM and LLM to perform VQA-like tasks. You could train each lightweight model separately and then combine them, or some high-performance VLMs already have quite LLM-like capabilities.
However, I think a model like LLaVA, which is a combination of VLM and LLM, would be more suitable.
Based on the sources provided, here are effective approaches and models for training on image-text pairs to understand scientific figures and generate reasoned text outputs:
1. Contrastive Learning with Captioning Models
Model: CoCa (Contrastive Captioner) [1]
CoCa is a foundation model that leverages both contrastive and captioning losses. It aligns images and text by learning similar representations for related image-text pairs and generates descriptive captions.
Key Features:
Simultaneous learning of cross-modal alignment and caption generation.
Effective for nuanced understanding of visual-text relationships.
Use Case: Ideal for your dataset, as it can handle image-text pairs and generate context-aware captions.
Model: Mistral 7B [3]
A large language model fine-tuned for image captioning tasks. It focuses on generating human-like captions by understanding complex scenes.
Key Features:
Sophisticated scene understanding and natural language description.
Can be adapted for scientific figures by training on your dataset.
ECR employs inference chaining to analyze images deeply and generate detailed captions. It is particularly effective for complex scenes and fine-grained information.
Key Features:
Focuses on reasoning and semantic parsing for accurate and detailed descriptions.
Fine-tuned on datasets like ICICD, which includes images, captions, and textual context.
Use Case: Suitable for your dataset, as it emphasizes understanding the relationships between images, captions, and textual context.
3. Contrastive Learning and Multi-Modal Training
Approach: Contrastive learning [2][4]
Train a model to align images and text by encouraging similar representations for related pairs. This is particularly useful when figure captions are critical for differentiation.
Implementation:
Use pre-trained models like CoCa or Mistral 7B and fine-tune them on your dataset.
Incorporate the figure captions as part of the training input to guide the model toward accurate and context-aware reasoning.
Model: Multi-Modal Transformers [2]
Models like MAsked Pre-training (MAST) can process images and text together, improving cross-modal understanding.
Key Features:
Handles image-text pairs as input and generates text output aligned with the visual context.
Effective for reasoning tasks where captions are central to understanding.
Recommendations
Start with CoCa for its strong performance in image-text alignment and caption generation.
Fine-tune Mistral 7B or ECRMM on your dataset to leverage advanced scene understanding and reasoning capabilities.
Use contrastive learning to align images with their captions, especially when figures are visually similar.
References
[1] Learn CoCa: Image-Text Foundation Models with Contrastive Captioners [Source]
[2] Multimodal training - Transformers - Hugging Face Forums [Source]
[3] Image Captioning with Mistral 7B LLM: A Hands-on Guide [Source]