Fine-tuning the vision-to-language projection adapter for a VLM (GeoChat) when adapting to a new captioning domain

Hello,

I’m working with the MBZUAI/GeoChat-7B model on Hugging Face. My images are the same type of satellite imagery that GeoChat was pretrained on, but I need the captions to follow my own domain style.

Right now I have:

  • Frozen the CLIP vision tower
  • Frozen the projector (the small MLP that maps CLIP embeddings into the LLM’s space)
  • Added LoRA adapters on q_proj and v_proj in the LLM

However, my generated captions are gibberish, even though they use the same tokens I expect. I’ve read that unfreezing and fine-tuning the projector MLP is important—because it “translates” visual features into the correct embedding dialect for the LLM to produce domain-specific text.

Questions:

  • Is it sufficient to simply add the projector to my LoRA target modules (e.g. target_modules=[‘q_proj’,‘v_proj’,‘mm_projector’])?
  • Are there recommended hyperparameters or training strategies (learning rate, weight decay, scheduler) specifically for tuning the projector MLP?

Any pointers, code snippets, or links to Hugging Face discussion threads or blog posts would be greatly appreciated. Thanks in advance!

1 Like

Hmm… The method for narrowing down the parameters to be trained with PEFT seems to be quite complicated. (This also means that detailed settings are possible…)

To adapt GeoChat to your own domain, you must unfreeze and fine-tune the projector MLP. It’s the layer that maps CLIP’s visual embeddings into the LLM’s token space. Freezing it blocks domain-specific adaptation.

Solution Steps:

Add the projection MLP module (e.g., mm_projector, projector.mlp, etc.) to your LoRA target_modules.
Example:

target_modules=[“q_proj”, “v_proj”, “mm_projector”]

Verify the exact name via:

print(model.named_modules())

Set your PEFT config to allow training:

peft_config = LoraConfig(
target_modules=target_modules,

)

Unfreeze the projector manually if needed:

for name, param in model.mm_projector.named_parameters():
    param.requires_grad = True

Tip from Triskel Data Deterministic AI:
Projection layers are domain-specific translators. If you don’t adapt them, your LLM hears the wrong language from the image tower.

1 Like