I was learning about VLM. and tried to create a small one using “HuggingFaceTB/SmolLM2-135M“ and “google/siglip-base-patch16-224“ with their pre-trained weights. But the loss is stuck between 0.5 and 0.4 and the model is generating the trained dataset like caption but that has nothing to do with the input image. It look like the model is completely ignoring the image embeddings.
Google colab notebook: click here
I am just learning so this might be some silly mistake or i might have misunderstood something. I tried many things, used multiple LLMs for an explanation, but nothing helped or gave a proper explanation. Please point out what’s wrong here.
I tried training for 500 - 600 steps but results where same as doing a single epoch which as around 250 steps.