Hi,
This is the first time I’m trying to fine-tune a model so I hope someone can help me.
I’m trying to build a semantic-search using weaviate and the following sentence-transformer model: sentence-transformers/multi-qa-mpnet-base-dot-v1
This works relatively well but since the content is of a specific domain (and in German language), I wanted to fine-tune the model using GPL like here Domain Adaptation — Sentence-Transformers documentation
I generated around 40k questions using ChatGPT instead of a T5 or something, because this seemed to work better and I didn’t want to have to fine-tune a T5.
Then I did negative-mining using this model: ml6team/cross-encoder-mmarco-german-distilbert-base
I did not fine-tune this model, maybe that’s the issue?
After that I had my training data, each sample of which consisted of a question, a positive text, a negative text and a margin.
The question are anywhere between 10 and 30 words long. The texts are anywhere between 50 and 300 words long.
This is the code I used (small batch size to accomodate little memory):
import random
import pandas as pd
from torch.utils.data import DataLoader
from sentence_transformers import SentenceTransformer, InputExample, losses
from sentence_transformers import evaluation
def eval_callback(score, epoch, steps):
print(epoch, steps, score)
df = pd.read_csv("question_context_neg_triplets.csv")
#train
train_examples = []
#eval
anchors = []
positives = []
negatives = []
for _, row in df.iterrows():
frage = row["frage"]
positiv = row["positive"]
negativ = row["negativ"]
if random.random() < 0.05: # 5% of the data we'll use for eval
anchors.append(frage)
positives.append(positiv)
negatives.append(negativ)
else:
train_examples.append(InputExample(texts=[frage, positiv, negativ], label=row["margin"]))
print(len(train_examples), "for training")
print(len(anchors), "for evaluation")
evaluator = evaluation.TripletEvaluator(anchors, positives, negatives, main_distance_function=3) # 3 == dot product
"""# Update the Bi-Encoder Model
We update the bi-encoder model with the new triplets `(query, positive, negative)` using MarginMSELoss
"""
print("Training Bi-Encoder")
#max_seq_length = 512
model_name = "sentence-transformers/multi-qa-mpnet-base-dot-v1"
model_save_path = "/content/drive/MyDrive/multi-qa-mpnet-base-dot-v1-gpld"
model = SentenceTransformer(model_name)
#model.max_seq_length = max_seq_length
train_dataloader = DataLoader(train_examples, batch_size=6, drop_last=True, shuffle=True)
train_loss = losses.MarginMSELoss(model)
#Tune the model
model.fit(train_objectives=[(train_dataloader, train_loss)],
epochs=1,
warmup_steps=int(len(train_dataloader)*0.1),
evaluator=evaluator,
evaluation_steps=500,
callback=eval_callback,
output_path=model_save_path)
I did re-ingest all data in weaviate, ensuring that the new model was used to embed them.
After fine-tuning recall goes down A LOT.
Any help would be very much appreciated!