hello! Does anyone have an example on how to fine-tune a multimodal model? The examples that I found only have texts as input, i.e. LLMs. I’m dealing with Vision-Language Models
1 Like
Did you find one?
Hi,
Yes find them all in my repository GitHub - NielsRogge/Transformers-Tutorials: This repository contains demos I made with the Transformers library by HuggingFace..
Demo notebooks are grouped per model, so I’d recommend taking a look at the BLIP-2, Idefics2, PaliGemma and LLaVa folders for some examples.
5 Likes
Hi! You can try AnyModal: GitHub - ritabratamaiti/AnyModal: AnyModal is a Flexible Multimodal Language Model Framework
It’s a project I have been working on that allows for the creation of multimodal LLMs.
1 Like