Share your projects!

ahmedfarazsyk · November 27, 2025, 12:45pm

Title: Fine-Tuning a Domain-Specific Text Ranker: Lessons in Class Imbalance

Description: I recently fine-tuned a Cross-Encoder for scientific text ranking, transforming a generic language model (cross-encoder/ms-marco-MiniLM-L6-v2) into a precision tool for information retrieval.

Using the SciDocs dataset, I navigated several critical machine learning challenges:

The Challenge: The dataset was heavily imbalanced (~81% irrelevant documents), leading my initial models to collapse into predicting “Not Relevant” for everything (Accuracy ~29%, F1 ~0.0).
The Solution: I implemented a custom WeightedTrainer with a calibrated BCEWithLogitsLoss (pos_weight=5.0) to penalize missed relevant documents 5x more than false positives. This successfully counteracted the majority class bias.
The Result: The final model stabilized at 91.75% Accuracy and 0.75 F1 Score. It can now distinguish between direct semantic matches, irrelevant keywords, and tricky “hard negatives” with high confidence.

This project highlights the importance of custom loss functions over standard training loops when dealing with real-world, imbalanced data.

Key Tech Stack: PyTorch, Hugging Face Transformers, Sentence-Transformers.

Here is the link: ahmedfarazsyk/ms-marco-MiniLM-L6-v2-finetuned-scidocs · Hugging Face

Topic		Replies	Views
About the Hub category 🤗Hub	0	2666	July 17, 2021
Tag a model related to a dataset 🤗Datasets	1	273	May 5, 2021
Uploading a pipeline to the model hub? 🤗Hub	0	835	January 20, 2022
Adding model from HuggingFace to Adapter-hub without training 🤗Hub	0	906	January 25, 2022
[Announcement] Model Versioning: Upcoming changes to the model hub Models	34	15208	December 4, 2020

Share your projects!

Related topics