Incremental learning for image captioning

Total newbie here. I want to further train “nlpconnect/vit-gpt2-image-captioning” model for Indian specific dataset. Following many tutorials, I only ended up fine-tuning the model. This caused it to erase most of its previous knowledge. I found some articles on freezing layers but couldn’t find a workaround for this model. Is there a way to achieve what i want?

1 Like

I have never done LLM or VLM training, but wouldn’t the technique of transfer learning be useful in this case?

As you mentioned “erase most of its previous knowledge”. Seems like, you are using very higher learning rate, try adjusting it acording to the task:

Learning Rate:

  • Small: 1e-6 to 5e-6 (recommended for most tasks)
  • Medium: 1e-5 to 3e-5 (for tasks requiring more adaptation)
  • Large: 5e-5 to 1e-4 (for simple tasks or small datasets)

These are a range of them try choosing one Acording to your context.

1 Like

Hey !

@Pankaj8922 has well precised the role of learning rate. I would add a part on freezing some layers.

When fine-tuning VLMs you might want to freeze one of the two components (either text or image part). A full fine-tuning of both elements is costly in terms of GPU and risky in terms of performance as you’re not sure of how the previous knowledge would perform.

From my experience, there are two ways:

  • You keep the text_encoder of the model frozen : Often it is the biggest component of the VLM and it offers already a good textual embedding. Hence you’ll fine tune only the image encoding part on your new dataset. To do so, depending on your model, you have to set the trainable parameter to false. It could look like :
model.text_encoder.trainable = False
  • You freeze some layers of both models : sometimes your dataset is exotic both in images and prompts (medical, science or any advanced topic). In that case, you might want to fine-tune both component to let the text part learn the vocabulary and the image part learning the structure of your new images. You can set several transformers block to be “untrainable” as you can do with the text encoder. The way of freezing it is linked to the model architecture but it would look like :
layers_to_freeze = 6 # In a case of having both models with 12 attention blocks

i = 0
for layer in model.text_encoder:
       if i < layers_to_freeze:
              layer.trainable = False
               i+=1

for layer in model.image_encoder:
       if i < layers_to_freeze:
              layer.trainable = False
               i+=1
       

Learning rate can be tricky to find and set so coupling both approach can be definitely useful for your use case !