[HELP] custom trained VLM ignoring input image.

kethankrk · October 16, 2025, 2:42pm

I was learning about VLM. and tried to create a small one using “HuggingFaceTB/SmolLM2-135M“ and “google/siglip-base-patch16-224“ with their pre-trained weights. But the loss is stuck between 0.5 and 0.4 and the model is generating the trained dataset like caption but that has nothing to do with the input image. It look like the model is completely ignoring the image embeddings.

Google colab notebook: click here

I am just learning so this might be some silly mistake or i might have misunderstood something. I tried many things, used multiple LLMs for an explanation, but nothing helped or gave a proper explanation. Please point out what’s wrong here.

I tried training for 500 - 600 steps but results where same as doing a single epoch which as around 250 steps.

John6666 · October 17, 2025, 4:41am

I managed to reproduce the issue and make a few minor improvements, but I’m not sure if it’s actually fixed…

Topic		Replies	Views
Masked language modeling loss 🤗Transformers	1	4848	August 13, 2020
Multimodal LLM with Image and Text sequentially in its prompt 🤗Transformers	2	12785	January 1, 2024
How can I make a Img2Text transformer using the existent modules? Intermediate	1	830	October 21, 2021
Fine-Tuning AutoModelWithLMHead Model 🤗Transformers	1	714	January 10, 2022
Why Fine-Tune a ViLT model For Images And Text Classification is showing out of index error? 🤗Transformers	4	473	January 16, 2023

[HELP] custom trained VLM ignoring input image.

Related topics