Hi!
I finetuned BLIP2-2.7 to create image descriptions for the HM dataset using their images and article descriptions as captions.
I was curious about different parameters and therefore did a grid search to tune the hyperparameters, focusing on the learning rate, batch size, and LoRA layers. The learning rate was tested with values of 1e-5, 5e-5, 1e-4, 5e-4, while the effective batch size was varied across 16, 32. If the entire batch could not fit into GPU memory, I used gradient accumulation to adjust accordingly. For the LoRA layers, two configurations were explored: all-linear, where LoRA was applied to all linear layers, and QV, where only the query and value layers were adapted. This setup resulted in a total of 16 possible combinations.
I used Weights and Biases to track my experiments and got the following plots. I am sharing this because I would like to get a sanity check to see if I interpreted the results correctly.
There are obvious outliers in the training and if you are interested let me know and I can tell you the settings in detail. Otherwise, I generally get the impression that
- most settings work well seeing that the training and validation loss decrease.
- using QV layers does not impact the validation loss drastically but can improve running time by reducing the time for one epoch from 4h to 2h per epoch with 20847 batches.
Would be happy to hear some opinions, if you have questions let me know.