Image Captioning fine tuning

Please help me to fine tune image captioning. I want to fine tune CLIP, VIT and BLIp. If any other models are there please help to get.