Title: Fine-Tuning a Domain-Specific Text Ranker: Lessons in Class Imbalance
Description: I recently fine-tuned a Cross-Encoder for scientific text ranking, transforming a generic language model (cross-encoder/ms-marco-MiniLM-L6-v2) into a precision tool for information retrieval.
Using the SciDocs dataset, I navigated several critical machine learning challenges:
-
The Challenge: The dataset was heavily imbalanced (~81% irrelevant documents), leading my initial models to collapse into predicting “Not Relevant” for everything (Accuracy ~29%, F1 ~0.0).
-
The Solution: I implemented a custom
WeightedTrainerwith a calibratedBCEWithLogitsLoss(pos_weight=5.0) to penalize missed relevant documents 5x more than false positives. This successfully counteracted the majority class bias. -
The Result: The final model stabilized at 91.75% Accuracy and 0.75 F1 Score. It can now distinguish between direct semantic matches, irrelevant keywords, and tricky “hard negatives” with high confidence.
This project highlights the importance of custom loss functions over standard training loops when dealing with real-world, imbalanced data.
Key Tech Stack: PyTorch, Hugging Face Transformers, Sentence-Transformers.
Here is the link: ahmedfarazsyk/ms-marco-MiniLM-L6-v2-finetuned-scidocs · Hugging Face