Fine-tuning the vision-to-language projection adapter for a VLM (GeoChat) when adapting to a new captioning domain

halox7000 · June 17, 2025, 2:23am

Hello,

I’m working with the MBZUAI/GeoChat-7B model on Hugging Face. My images are the same type of satellite imagery that GeoChat was pretrained on, but I need the captions to follow my own domain style.

Right now I have:

Frozen the CLIP vision tower
Frozen the projector (the small MLP that maps CLIP embeddings into the LLM’s space)
Added LoRA adapters on q_proj and v_proj in the LLM

However, my generated captions are gibberish, even though they use the same tokens I expect. I’ve read that unfreezing and fine-tuning the projector MLP is important—because it “translates” visual features into the correct embedding dialect for the LLM to produce domain-specific text.

Questions:

Is it sufficient to simply add the projector to my LoRA target modules (e.g. target_modules=[‘q_proj’,‘v_proj’,‘mm_projector’])?
Are there recommended hyperparameters or training strategies (learning rate, weight decay, scheduler) specifically for tuning the projector MLP?

Any pointers, code snippets, or links to Hugging Face discussion threads or blog posts would be greatly appreciated. Thanks in advance!

John6666 · June 17, 2025, 6:05am

Hmm… The method for narrowing down the parameters to be trained with PEFT seems to be quite complicated. (This also means that detailed settings are possible…)

github.com/huggingface/peft

modules_to_save not working for AutoModelForSequenceClassification

opened 06:51PM - 01 Nov 23 UTC

closed 03:03PM - 11 Dec 23 UTC

drei34

### System Info Hi I am using LLAMA2 and GPT2 for sequence classification. Bo…th models add a "score" layer on top to transform the last embedding of the tokens into a vector of class logits. If I specify the layers are below for GPT2 or LLAMA2, I see NaN for the rain and validation accuracy. If I use "classifier", which is not the name of a layer, everything "works" in the senese that I get a loss and accuracy improves but as I understanding it the classifier head is just random numbers, so all other parameters are changing to try and circumvent this layer. Also, we can put "score" in "target_modules" and this works, but this should not be what we do if I understand right. This layer has no information of value in it, so it should be properly fine tuned. Any ideas on what is wrong? ```python config = LoraConfig( r=16, lora_alpha=16, # These are the LLAMA2 layers #target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], # These are the GPT2 layers target_modules=["c_attn", "c_proj", "c_fc", "c_proj"], # We can put "score" in here and then this layer is fine tuned with LORA, but it should be fine tuned since the values in this layer are all random #target_modules=["c_attn", "c_proj", "c_fc", "c_proj", "score"] lora_dropout=0.1, bias="none", modules_to_save=["score"], # This gives NaN loss # The below works, but this "classifier" layer is not in the model so effectively nothing happens # modules_to_save=["score"] ) ``` ### Who can help? _No response_ ### Information - [ ] The official example scripts - [ ] My own modified scripts ### Tasks - [ ] An officially supported task in the `examples` folder - [ ] My own task or dataset (give details below) ### Reproduction ```python import warnings warnings.filterwarnings('ignore') import pandas as pd import torch from torch import nn import numpy as np import re from transformers import LlamaTokenizer, LlamaForCausalLM from transformers import AutoTokenizer from transformers import AutoModelForSequenceClassification, LlamaForSequenceClassification from peft import ( LoraConfig, get_peft_model, get_peft_model_state_dict, prepare_model_for_int8_training, ) from tqdm import tqdm from torch.utils.data import Dataset from torch.utils.data import DataLoader from torchmetrics import Accuracy from tqdm import tqdm from torch.utils.data import Dataset, DataLoader, random_split import pandas as pd from tqdm import tqdm from torch.utils.data import Dataset, DataLoader, random_split model_name = 'TinyPixel/Llama-2-7B-bf16-sharded' num_labels = 3 model = AutoModelForSequenceClassification.from_pretrained( model_name, num_labels=num_labels, torch_dtype=torch.float16, device_map='auto') tokenizer = AutoTokenizer.from_pretrained( model_name, use_fast=False, trust_remote_code=True, device_map='auto' ) tokenizer.pad_token = tokenizer.eos_token tokenizer.padding_side = "left" # Fix weird overflow issue with fp16 training model.resize_token_embeddings(len(tokenizer)) # https://github.com/huggingface/transformers/issues/1805 # Pass this down into whatever training loop you have # modules_to_save should be called "score" but this produces NaN loss # If we use any other string, optimization works but it does not make sense to me config = LoraConfig( r=16, lora_alpha=16, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], lora_dropout=0.1, bias="none", modules_to_save=["score"], ) lora_model = get_peft_model(model, config) print(lora_model.print_trainable_parameters()) ``` ### Expected behavior - The above should fine tune the modules_to_save not give NaN los. - If the modules_to_save is NOT in the model, maybe this should crash? You can specify it as "random layer" and this will all work but it does nothing to this layer since it is not in the model.

Pimpcat-AU · June 17, 2025, 6:09am

To adapt GeoChat to your own domain, you must unfreeze and fine-tune the projector MLP. It’s the layer that maps CLIP’s visual embeddings into the LLM’s token space. Freezing it blocks domain-specific adaptation.

Solution Steps:

Add the projection MLP module (e.g., mm_projector, projector.mlp, etc.) to your LoRA target_modules.
Example:

target_modules=[“q_proj”, “v_proj”, “mm_projector”]

Verify the exact name via:

print(model.named_modules())

Set your PEFT config to allow training:

peft_config = LoraConfig(
target_modules=target_modules,
…
)

Unfreeze the projector manually if needed:

for name, param in model.mm_projector.named_parameters():
    param.requires_grad = True

Tip from Triskel Data Deterministic AI:
Projection layers are domain-specific translators. If you don’t adapt them, your LLM hears the wrong language from the image tower.

Topic		Replies	Views
Fine-Tuning a Mamba Model with using Hugging Face Transformers 🤗Transformers	1	184	March 18, 2025
Example for Fine Tuning CLIP or BLIP2 for VQA Beginners	18	9184	February 20, 2025
Bad Performance Finetuning Llama Chat and Instruct Models on GSM8K Beginners	5	1105	December 5, 2024
Finetuning existing Lora Adapters gives "Attempting to unscale FP16 gradients" - Error 🤗Transformers	2	1312	June 25, 2024
"You cannot perform fine-tuning on purely quantized models." error in LoRA model training? 🤗Transformers	3	2623	August 16, 2024

Fine-tuning the vision-to-language projection adapter for a VLM (GeoChat) when adapting to a new captioning domain

Related topics