Good Morning
I want to use clip as a binary model (namely give an input of a single sentence and an image)
The existing models don’t work since they output too high values in the last linear layer. Which means that they may provide high probability under the sigmoid.
Which model can solve this?