I have a multi-label dataset with two columns, sentence
and label
. I want to train an embedding model for similarity comparison .
Currently, I am training with the following steps:
- loading the origin dataset.
- pairing the sentences, assigning similarity 1 to pairs with the same label and 0 to others.
- constructing a new dataset with three columns
sentence1
,sentence2
, andscore
- finally train the model with the new dataset
I found the pairing process time consuming. And it would consume a lot of disk space if I store the new dataset as there are many duplicate sentences. Is there any solution to avoid the pairing process while consuming low disk space?