Integrated gradients for explainability of VLMs

nicokossmann · February 3, 2025, 7:47pm

Hi there ,

I would like to use Integrated Gradients from the Libary Captum for Image and Text to explain the results of VLM models such as LLava-OneVision or Phi-3.5-Vision-instruct on a VQA task with multiple images.
I have already created a Google Colab notebook for a simple model from Transformers. However, I cannot pass the pixel_values and input_embeds together when using the llava model. Do you have any ideas how to overcome this?
Also, the results for the dandelin/vilt-b32-finetuned-vqa model don’t look quite right somehow.

For LLaVA-OneVision I has created this notebook

Thank you for your help
Nico

Topic		Replies	Views
How to get Visual/Text/Multimodal Embedding from llava Model Beginners	3	1343	December 11, 2024
Fine-tunening a multimodal model Beginners	4	4954	December 25, 2024
Masking the user prompt/question for LLaVA loss computation Models	2	171	February 12, 2025
Multimodal training 🤗Transformers	4	51	March 21, 2025
LLaVA multi-image input support for inference Models	8	7392	August 30, 2024

Integrated gradients for explainability of VLMs

Related topics