Can we train Sentence transformer model for Sequence classification

Can we use fine tuned Sentence transformer model for fine tuning AutoModelForSequenceClassification()?

    • I have a Fine tuned Sentence transforme model which has pooling layers also.
    • Now, I take this model and again fine tune on AutoModelForSequenceClassification() for binary classes.

I only get this warning:

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at sent_tranf_model/ and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained("sent_tranf_model/")
NUM_LABELS = len(idx_to_label)
model = AutoModelForSequenceClassification.from_pretrained("ssent_tranf_model/",num_labels = NUM_LABELS)

Is it okay to do this?


I don’t believe you will get consistent and meaningful classification results with the randomly initialized parameters in the classifier layer as the warning message suggests. You can fine tune the model if you have labeled data but I wouldn’t expect a performance gain over BERT/RoBERTa fine-tuning. The advantage of using sentence transformers becomes apparent in the unsupervised setting.

If you want to solve a classification task with sentence transformers models, you can exploit similarity metrics between the embedding of the text to be classified and a representation embedding of each class. The question is to find those representation embeddings. Conventional unsupervised methods like clustering can be useful but you would need to map (maybe manually) the generated clusters to the classes you have. Using a more modern approach, you may describe the properties of the class with a prompt, get its embedding and treat it as the representation embedding.

TL;DR if you directly use the model for inference after loading it with this, it would not generate meaningful classifications. You can fine-tune it after initializing it like this, but I wouldn’t expect a performance gain compared to BERT/RoBERTa fine-tuning. If you don’t have labeled data, you can use unsupervised learning to match sentence embeddings with class embeddings.

What if I fine-tune this for 10/15 epochs so the weights make substantial meaning with labelled data. Won’t it work in this case then?

Sure, if you fine-tune the model with your data, it would work for your task. As I said, I don’t see a substantial advantage over fine-tuning other models such as BERT/RoBERTa but it surely can provide such an advantage depending on the specifications of the downstream task. The problem is the random initialization of the classifier layers and fine-tuning should solve that.

As always, the performance of the fine-tuned model depends on the quality of the data and hyperparameters.

One advantage is model size reduction and fast model training. Agree?

Well, I don’t know which specific checkpoint you are referring to, but these two models have exactly the same number of parameters:

So, there isn’t a model size reduction but arguably, Sentence BERT model is trained further after the standard pretraining process which might result a better harnessing of general knowledge.