Fine-tuning sentence-transformer for retrieval task makes things worse

Hi,

This is the first time I’m trying to fine-tune a model so I hope someone can help me.
I’m trying to build a semantic-search using weaviate and the following sentence-transformer model: sentence-transformers/multi-qa-mpnet-base-dot-v1

This works relatively well but since the content is of a specific domain (and in German language), I wanted to fine-tune the model using GPL like here Domain Adaptation — Sentence-Transformers documentation

I generated around 40k questions using ChatGPT instead of a T5 or something, because this seemed to work better and I didn’t want to have to fine-tune a T5.
Then I did negative-mining using this model: ml6team/cross-encoder-mmarco-german-distilbert-base
I did not fine-tune this model, maybe that’s the issue?

After that I had my training data, each sample of which consisted of a question, a positive text, a negative text and a margin.
The question are anywhere between 10 and 30 words long. The texts are anywhere between 50 and 300 words long.

This is the code I used (small batch size to accomodate little memory):

import random
import pandas as pd
from torch.utils.data import DataLoader
from sentence_transformers import SentenceTransformer, InputExample, losses
from sentence_transformers import evaluation


def eval_callback(score, epoch, steps):
    print(epoch, steps, score)

df = pd.read_csv("question_context_neg_triplets.csv")


#train
train_examples = []
#eval
anchors = []
positives = []
negatives = []


for _, row in df.iterrows():
    frage = row["frage"]
    positiv = row["positive"]
    negativ = row["negativ"]
    
    if random.random() < 0.05:  # 5% of the data we'll use for eval
        anchors.append(frage)
        positives.append(positiv)
        negatives.append(negativ)
    else:  
        train_examples.append(InputExample(texts=[frage, positiv, negativ], label=row["margin"]))    


print(len(train_examples), "for training")
print(len(anchors), "for evaluation")

evaluator = evaluation.TripletEvaluator(anchors, positives, negatives, main_distance_function=3) # 3 == dot product

"""# Update the Bi-Encoder Model

We update the bi-encoder model with the new triplets `(query, positive, negative)` using MarginMSELoss
"""
print("Training Bi-Encoder")

#max_seq_length = 512
model_name = "sentence-transformers/multi-qa-mpnet-base-dot-v1"
model_save_path = "/content/drive/MyDrive/multi-qa-mpnet-base-dot-v1-gpld"

model = SentenceTransformer(model_name)
#model.max_seq_length = max_seq_length
train_dataloader = DataLoader(train_examples, batch_size=6, drop_last=True, shuffle=True)
train_loss = losses.MarginMSELoss(model)

#Tune the model
model.fit(train_objectives=[(train_dataloader, train_loss)],
          epochs=1,
          warmup_steps=int(len(train_dataloader)*0.1), 
          evaluator=evaluator, 
          evaluation_steps=500,
          callback=eval_callback,
          output_path=model_save_path)

I did re-ingest all data in weaviate, ensuring that the new model was used to embed them.
After fine-tuning recall goes down A LOT.

Any help would be very much appreciated!