F1 is always 0 for multi-label classification task

Hello, it’s my first time trying to fine-tune a model, and I’m running into some issues with getting a decent F1 score.

The source code of my fine tuning experiment is here: mitbforalldemo/fine_tuning/bert-fine-tuning.ipynb at main · calvinli2024/mitbforalldemo · GitHub

I’m using a HF dataset that I found here: maximuspowers/philosophy-schools-multilabel · Datasets at Hugging Face

I’ve examined both the fine-tuning code as well as the dataset, and I can’t seem to find any glaring issues with either - and yet my F1 scores are always zero when I train.

I can get a non-zero F1 for other multi-label datasets using the same code, so I doubt it’s the fine-tuning code. And I can’t see how anything can be wrong with a dataset as simple and organized as the one I linked.

Any help would be appreciated!

1 Like

How about like this?

import numpy as np
from sklearn.metrics import f1_score

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    probs = 1 / (1 + np.exp(-logits))        # sigmoid
    preds = (probs >= 0.30).astype(np.int32) # start at 0.30; tune later
    return {
        "f1": f1_score(labels, preds, average="weighted", zero_division=0),
    }