Incremental learning for image captioning

anshraiyani · September 28, 2024, 11:54am

Total newbie here. I want to further train “nlpconnect/vit-gpt2-image-captioning” model for Indian specific dataset. Following many tutorials, I only ended up fine-tuning the model. This caused it to erase most of its previous knowledge. I found some articles on freezing layers but couldn’t find a workaround for this model. Is there a way to achieve what i want?

John6666 · September 29, 2024, 7:11am

I have never done LLM or VLM training, but wouldn’t the technique of transfer learning be useful in this case?

Pankaj8922 · September 29, 2024, 7:20am

As you mentioned “erase most of its previous knowledge”. Seems like, you are using very higher learning rate, try adjusting it acording to the task:

Learning Rate:

Small: 1e-6 to 5e-6 (recommended for most tasks)
Medium: 1e-5 to 3e-5 (for tasks requiring more adaptation)
Large: 5e-5 to 1e-4 (for simple tasks or small datasets)

These are a range of them try choosing one Acording to your context.

samchain · October 1, 2024, 4:14pm

Hey !

@Pankaj8922 has well precised the role of learning rate. I would add a part on freezing some layers.

When fine-tuning VLMs you might want to freeze one of the two components (either text or image part). A full fine-tuning of both elements is costly in terms of GPU and risky in terms of performance as you’re not sure of how the previous knowledge would perform.

From my experience, there are two ways:

You keep the text_encoder of the model frozen : Often it is the biggest component of the VLM and it offers already a good textual embedding. Hence you’ll fine tune only the image encoding part on your new dataset. To do so, depending on your model, you have to set the trainable parameter to false. It could look like :

model.text_encoder.trainable = False

You freeze some layers of both models : sometimes your dataset is exotic both in images and prompts (medical, science or any advanced topic). In that case, you might want to fine-tune both component to let the text part learn the vocabulary and the image part learning the structure of your new images. You can set several transformers block to be “untrainable” as you can do with the text encoder. The way of freezing it is linked to the model architecture but it would look like :

layers_to_freeze = 6 # In a case of having both models with 12 attention blocks

i = 0
for layer in model.text_encoder:
       if i < layers_to_freeze:
              layer.trainable = False
               i+=1

for layer in model.image_encoder:
       if i < layers_to_freeze:
              layer.trainable = False
               i+=1

Learning rate can be tricky to find and set so coupling both approach can be definitely useful for your use case !

Topic		Replies	Views
What is transfer learning and why is it needed? Beginners	1	2096	March 16, 2021
How to Train an Image Captioning Model for specific language Beginners	3	18	March 9, 2025
How to train new token embedding to add to a pretrain model? 🤗Transformers	1	3633	January 6, 2021
Multimodal training 🤗Transformers	4	51	March 21, 2025
Gradual Unfreezing support for Fine tuning models 🤗Transformers	3	3933	August 26, 2020

Incremental learning for image captioning

Related topics