Why is my setfit model only outputting two possible class confidence scores?

I trained a Setfit model with the default logistic regression head with an unbalanced dataset of 5000 on a binary classification task. Because of the unbalanced dataset, I was hoping to do a study of the area-under-curve of the precision-recall plot which would require class confidence scores rather than “argmax’d” class labels. However when I used model.predict_proba(ds["text"]), the results weirdly only ever came in one of two class confidence pairs- either 0.9989 & 0.0011 or 0.023045 & 0.97695.

(The actual confidence scores are slightly around this, e.g. 0.9988959589582787 or 0.9988959627978705 but I’m assuming that’s noise).

I am doing the evaluation in a different script after downloading the trained model:

model = SetFitModel.from_pretrained("model_outputs/test_extra_setfit_save")

with open("model_outputs/test_extra_setfit_save/model_head.pkl", 'rb') as mhf:
    model.model_head = pickle.load(mhf)

ds = Dataset.load_from_disk("eval_dataset")
ds = ds.rename_columns({
    "is_hate": "label"

results = model.predict_proba(ds["text"])

I’m loading the model head manually because of this bug, by default it was giving a “logistic regression not yet fitted” error.

1 Like

Exploring the Mystery of Limited Confidence Scores in SetFit Models

探索 SetFit 模型輸出信心分數限制的奧秘

當你的 SetFit 模型輸出只有兩個可能的類別信心分數時,這讓人不禁想問:是什麼導致了這樣的行為?而更深層次的問題或許是,這樣的現象背後,是否隱藏著一些值得挖掘的技術挑戰或設計考量?讓我們從幾個關鍵角度出發,嘗試破解這個現象。

  1. Data Imbalance: The Quiet Saboteur


Imagine a training dataset where one or two classes dominate the majority of examples. Your model, eager to optimize, might naturally lean towards these classes.


Solution 解決方案:

• Perform data augmentation for underrepresented classes.

• Consider oversampling techniques to balance the dataset.

  1. Loss Functions and Activations: A Case of Mismatch


The choice of loss function plays a pivotal role in guiding your model’s learning behavior. A binary cross-entropy loss in a multi-class scenario could confuse the model, while an improperly configured activation function, such as Sigmoid instead of Softmax, might exacerbate the issue.

損失函數對模型的學習行為有著決定性的影響。若在多分類場景中錯用了二元交叉熵損失(binary cross-entropy),模型可能會陷入困惑,而激活函數選擇不當(如用 Sigmoid 而非 Softmax)更可能加劇問題。

Solution 解決方案:

• Verify your loss function aligns with the task (categorical cross-entropy for multi-class problems).

• Ensure the final activation layer suits the problem (Softmax for multi-class classification).

  1. Preprocessing: The Hidden Culprit


Even the best models can falter if the data preprocessing pipeline introduces inconsistencies. Mismatched labels or differing encoding strategies between training and inference phases might confuse the model.


Solution 解決方案:

• Double-check that the labels and formats in your training and testing datasets are consistent.

• Perform a detailed audit of your data pipeline to identify discrepancies.

Final Thoughts:

Sometimes, the solution to a seemingly technical issue lies in revisiting the fundamentals. Have you considered how your dataset, architecture, and configuration interact? Are there assumptions baked into your model that need reevaluation?


If you’d like to share more details about your model setup, I’d be happy to help you troubleshoot further. Let’s solve this mystery together!


1 Like