How can I finetune an embedding model with a multi label dataset for similarity comprison?

shijy16 · September 13, 2024, 1:48am

I have a multi-label dataset with two columns, sentence and label. I want to train an embedding model for similarity comparison .
Currently, I am training with the following steps:

loading the origin dataset.
pairing the sentences, assigning similarity 1 to pairs with the same label and 0 to others.
constructing a new dataset with three columns sentence1, sentence2, and score
finally train the model with the new dataset

I found the pairing process time consuming. And it would consume a lot of disk space if I store the new dataset as there are many duplicate sentences. Is there any solution to avoid the pairing process while consuming low disk space?

Topic		Replies	Views
Fine tuning a sentence transformer model for [single_sentence, label] format? 🤗Transformers	0	508	February 13, 2023
Finetune a model for Multi-label classification/regression - matrix/vactor as a label? Models	0	749	December 21, 2021
How to get embedding to each n-grams from a sentence using BERT? Research	0	752	August 12, 2022
Training for sentence vectors in niche domain Intermediate	18	3293	February 16, 2021
Sentence similarity Beginners	1	955	September 16, 2021

How can I finetune an embedding model with a multi label dataset for similarity comprison?

Related topics