Hi,
I do have tutorial notebooks to fine-tune multimodal models on image->JSON use cases, might be helpful (on the CORD dataset). Created the same one for different models:
- PaliGemma: https://github.com/NielsRogge/Transformers-Tutorials/tree/master/paligemma
- LLaVa: https://github.com/NielsRogge/Transformers-Tutorials/tree/master/llava
- Idefics2: Transformers-Tutorials/Idefics2 at master · NielsRogge/Transformers-Tutorials · GitHub.
Should be the same for LLMs.