Hi to everyone!
I was wondering if I can training a sentence transformer with a triplet-loss (with and without labels data) and then use this model (or its embedding) freezeing all its layers for fine-tuning of a classification head (such as a classic fully connected network) with the same data or an hidden portion data.
In alternative, can I training a classic transformer (BERT cross encoder) with a classification head using a classic classification loss such as cross entropy, but insted of using the embedding of CLS token to feed into classification head I feed into the head an embedding vector created by or max pooling o average pooling from all tokens embedding from the last layer of the transformer?
I ask because I am curious to find out whether the feature space of sentence transformer and a classic transformer is different but still useful for the classification task.
Thank you!