Fine-tunening a multimodal model

hello! Does anyone have an example on how to fine-tune a multimodal model? The examples that I found only have texts as input, i.e. LLMs. I’m dealing with Vision-Language Models