I’m utilizing the e5 base model to generate document vectors for a similarity task. Due to a limited amount of labeled data, I delved into unsupervised pretraining with TSDAE. Unfortunately, my pretrained model yields worse results than the standard e5 model without any fine-tuning. Is there something flawed in my approach?
pre_model = 'intfloat/multilingual-e5-base'
word_embedding_model = models.Transformer(pre_model)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), 'cls')
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
train_sentences = df['Text'].tolist()
train_dataset = datasets.DenoisingAutoEncoderDataset(train_sentences)
train_dataloader = DataLoader(train_dataset, batch_size=4, shuffle=True)
for epoch in range(pre_epochs):
train_loss = LoggingDenoisingAutoEncoderLoss(model, decoder_name_or_path=pre_model,tie_encoder_decoder=False)
model.fit(
train_objectives=[(train_dataloader,train_loss)],
epochs=1,
weight_decay=0,
scheduler='constantlr',
optimizer_params={'lr': pre_lr},
show_progress_bar=True
)
model.save(pretrain_save_path)
Or does pretraining typically replace all the current weights?