Transformer vs Sentence-Transformer for text classification

Hi to everyone!

I was wondering if I can training a sentence transformer with a triplet-loss (with and without labels data) and then use this model (or its embedding) freezeing all its layers for fine-tuning of a classification head (such as a classic fully connected network) with the same data or an hidden portion data.

In alternative, can I training a classic transformer (BERT cross encoder) with a classification head using a classic classification loss such as cross entropy, but insted of using the embedding of CLS token to feed into classification head I feed into the head an embedding vector created by or max pooling o average pooling from all tokens embedding from the last layer of the transformer?

I ask because I am curious to find out whether the feature space of sentence transformer and a classic transformer is different but still useful for the classification task.

Thank you!

1 Like