How does one even evaluate a fine-tuned model? I don’t want to evaluate it during training to keep things modular and since things take a while. I have been using a Triplet dataset for embedding that basically has a question, positive example and a negative example.
This is my code for evaluating so far:
model = SentenceTransformer(model_path, device='cuda')
model.eval()
ds = load_dataset('json', data_files=dataset_path ,split='train')
ds = ds.select(range(1000))
dev_evaluator = TripletEvaluator(
anchors=ds["question"],
positives=ds["positive_example"],
negatives=ds["negative_example"],
batch_size= 64,
show_progress_bar= True,
main_distance_function= SimilarityFunction.COSINE
)
print ("Beginning Evaluation...")
evaluation_score = dev_evaluator(model)
print(evaluation_score["cosine_accuracy"])
However there are a few problems with the code. First of all it somehow runs over all the data three times for some reason, and the second and third time are quite a bit slower than the first time.
Additionally i dont know if its just me but i struggled to even find an example for the main_distance_function, or even an example for a simple evaluation like this.
Any tips what im doing wrong or if i can somehow optimize things further with my GPU ?