Image captioning for Spanish with pre-trained vision and text model

So you want to create a CLIP like (dual encoder) model? @valhalla

1 Like