Multimodal training

Hi,

I have a dataset that consists of images, their captions (they are scientific figures) and some excerpts from the paper main text that references the figure. The goal of this is to for a given figure and its caption, can we understand the figure (the text in the paper). This is different from an image captioning problem but more of a reasoning problem.

I would appreciate any pointers on how to train on image-text pairs as input and text as output. In this instance the figure captions are quite important because many figures look alike even within a paper and the figure caption is important to differentiate between them.

Thanks for all the suggestions in advance.

2 Likes

In your case, I think you would want to combine VLM and LLM to perform VQA-like tasks. You could train each lightweight model separately and then combine them, or some high-performance VLMs already have quite LLM-like capabilities.

However, I think a model like LLaVA, which is a combination of VLM and LLM, would be more suitable.

VLMs

Other approaches by Hugging Chat


Based on the sources provided, here are effective approaches and models for training on image-text pairs to understand scientific figures and generate reasoned text outputs:


1. Contrastive Learning with Captioning Models

  • Model: CoCa (Contrastive Captioner) [1]

    • CoCa is a foundation model that leverages both contrastive and captioning losses. It aligns images and text by learning similar representations for related image-text pairs and generates descriptive captions.
    • Key Features:
      • Simultaneous learning of cross-modal alignment and caption generation.
      • Effective for nuanced understanding of visual-text relationships.
    • Use Case: Ideal for your dataset, as it can handle image-text pairs and generate context-aware captions.
  • Model: Mistral 7B [3]

    • A large language model fine-tuned for image captioning tasks. It focuses on generating human-like captions by understanding complex scenes.
    • Key Features:
      • Sophisticated scene understanding and natural language description.
      • Can be adapted for scientific figures by training on your dataset.

2. Explicit Image Caption Reasoning (ECR)

  • Model: ECRMM (Explicit Caption Reasoning Multimodal Model) [4]
    • ECR employs inference chaining to analyze images deeply and generate detailed captions. It is particularly effective for complex scenes and fine-grained information.
    • Key Features:
      • Focuses on reasoning and semantic parsing for accurate and detailed descriptions.
      • Fine-tuned on datasets like ICICD, which includes images, captions, and textual context.
    • Use Case: Suitable for your dataset, as it emphasizes understanding the relationships between images, captions, and textual context.

3. Contrastive Learning and Multi-Modal Training

  • Approach: Contrastive learning [2][4]

    • Train a model to align images and text by encouraging similar representations for related pairs. This is particularly useful when figure captions are critical for differentiation.
    • Implementation:
      • Use pre-trained models like CoCa or Mistral 7B and fine-tune them on your dataset.
      • Incorporate the figure captions as part of the training input to guide the model toward accurate and context-aware reasoning.
  • Model: Multi-Modal Transformers [2]

    • Models like MAsked Pre-training (MAST) can process images and text together, improving cross-modal understanding.
    • Key Features:
      • Handles image-text pairs as input and generates text output aligned with the visual context.
      • Effective for reasoning tasks where captions are central to understanding.

Recommendations

  • Start with CoCa for its strong performance in image-text alignment and caption generation.
  • Fine-tune Mistral 7B or ECRMM on your dataset to leverage advanced scene understanding and reasoning capabilities.
  • Use contrastive learning to align images with their captions, especially when figures are visually similar.

References

  • [1] Learn CoCa: Image-Text Foundation Models with Contrastive Captioners [Source]
  • [2] Multimodal training - :hugs:Transformers - Hugging Face Forums [Source]
  • [3] Image Captioning with Mistral 7B LLM: A Hands-on Guide [Source]
  • [4] Explicit Image Caption Reasoning (ECR) [Source]
1 Like

Training Tips

1 Like

Oh wow thank @John6666 for the detailed answers. I will check the models and references out.

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.