Fine-tunening a multimodal model

hello! Does anyone have an example on how to fine-tune a multimodal model? The examples that I found only have texts as input, i.e. LLMs. I’m dealing with Vision-Language Models

Did you find one?


Yes find them all in my repository GitHub - NielsRogge/Transformers-Tutorials: This repository contains demos I made with the Transformers library by HuggingFace..

Demo notebooks are grouped per model, so I’d recommend taking a look at the BLIP-2, Idefics2, PaliGemma and LLaVa folders for some examples.