BLIP-2 - Should the image + language model be frozen by default?

Was reading through the BLIP-2 paper, and saw that the image model and language model are frozen by default.

In the Hugging Face implementation the vision and language models are initialized without freezing (unless I’m missing something in the implementation). I think by default these should be frozen, as this is the training approach used in the paper, and otherwise will be training much more than expected (and not doing the desired bottlenecking with q-former).

In the upstream implementation by SalesForce they freeze ViT and the language model used.