I am fine-tuning BLIP2 for captionig images of the H&M dataset. I fine-tuned my model with the peft library using the following line of code to save the model:
from peft import LoraConfig, get_peft_model
from transformers import Blip2ForConditionalGeneration
config = LoraConfig(
use_rslora=True,
r=r, # default 8
lora_alpha=lora_alpha, # default 8
lora_dropout=lora_dropout, # default 0
bias=bias, # default none
target_modules=target_modules
)
checkpoint = "Salesforce/blip2-opt-2.7b"
model = Blip2ForConditionalGeneration.from_pretrained(checkpoint)
model = get_peft_model(model, config)
# train model
model.save_pretrained("best_model.pt")
which saves the models adapter_config.json
and adapter_model.safetensors
. I load the model via
from peft import PeftModel
model = Blip2ForConditionalGeneration.from_pretrained(checkpoint)
peft_model = PeftModel.from_pretrained(model, "best_model.pt")
I want to use the model I trained for extracting word and image embeddings. I did not find a function that does this for the PeftModel class. Therefore I used the pipeline that currently looks like this (based on this documentation
from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained(checkpoint)
extractor = pipeline(model=peft_model.base_model.model.vision_model, task="image-feature-extraction", tokenizer=processor.tokenizer, image_processor=processor, device=0)
result = extractor(test_ds[0]["image"], return_tensors=True)
result.shape # This is a tensor of shape [1, sequence_length, hidden_dimension] representing the input string.
My question: Is there a way I can verify that the vision model is fine-tuned? Or is the PeftModel Wrapper needed to use the fine-tuned weights?
When I print my model is see lora
layers but I am not sure what they mean.
OPTForCausalLM(
(model): OPTModel(
(decoder): OPTDecoder(
(embed_tokens): Embedding(50272, 2560, padding_idx=1)
(embed_positions): OPTLearnedPositionalEmbedding(2050, 2560)
(final_layer_norm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
(layers): ModuleList(
(0-31): 32 x OPTDecoderLayer(
(self_attn): OPTAttention(
(k_proj): lora.Linear(
(base_layer): Linear(in_features=2560, out_features=2560, bias=True)
(lora_dropout): ModuleDict(
(default): Dropout(p=0.1, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=2560, out_features=32, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=32, out_features=2560, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
)
(v_proj): lora.Linear(
(base_layer): Linear(in_features=2560, out_features=2560, bias=True)
...
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
)
)