Why is BCELoss used for multi-label classification?

acmc · October 4, 2024, 2:08pm

I was looking into the code for some models (e.g. transformers/src/transformers/models/bert/modeling_bert.py at de4112e4d20795b27bad0050e30f324a1a3a26f2 · huggingface/transformers · GitHub), and noticed this:

            elif self.config.problem_type == "single_label_classification":
                loss_fct = CrossEntropyLoss()
                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
            elif self.config.problem_type == "multi_label_classification":
                loss_fct = BCEWithLogitsLoss()
                loss = loss_fct(logits, labels)

I don’t get why the code looks like this. If we have a binary classification problem, I’d expect to use BCE there. Is this a bug or have I missed something?

swtb · October 7, 2024, 9:31am

machine learning - Should I use a categorical cross-entropy or binary cross-entropy loss for binary predictions? - Cross Validated (stackexchange.com)

acmc · October 7, 2024, 12:06pm

Yes, exactly. Which just adds to my point. Why is BCE used when we have 2+ labels, i.e., "multi_label_classification"?

swtb · October 12, 2024, 2:17pm

“For multi-label classification, the idea is the same. But instead of say 3 labels to indicate 3 classes, we have 6 labels to indicate presence or absence of each class (class1=1, class1=0, class2=1, class2=0, class3=1, and class3=0). The loss then is the sum of cross-entropy loss for each of these 6 classes.”

According to this comment its still binary within each class.

acmc · October 12, 2024, 7:37pm

Thanks for your answer. I appreciate you looking into this.

So — this is what I understand, feel free to correct my thinking:

My question is about why BCELoss is used for problems of type multi-label-classification (my assumption is that there may be 1+ true classes for a given data point).

A multi-label classifier would output O=(c_1,c_2,\dots,c_n), c \in [0,1].

My intuition would be to apply CELoss as:

loss = - \sum_{i}^{n} y_i * \log{c_i}

I really can’t see why this wouldn’t work. It’s the most standard formula ever.

Despite all, the implementation in HF Transformers uses a binary cross-entropy loss formula, which is a variation of the above for 2-label cases.

Then, you quote an answer in Cross Validated that says that you can reformulate a binary classification problem (e.g.):

Is Schrodinger’s cat dead or alive?

BCEloss=reality * \log{p(dead)} + (1 - reality) * \log{p(alive)}

As:

CEloss=reality_{dead} * \log{p(dead)} + (1 - reality_{dead}) * \log{(1 - p(dead))} + reality_{alive} * \log{p(alive)} + (1 - reality_{alive}) * \log{(1 - p(alive))}

Ok. All good with this. I absolutely agree that you can do this and it works with multi-label classification, also expanded to 2+ classes. But my question, the one in the first message, is still why can’t you use the usual CE loss?

I feel that you’re quoting that answer in Cross Validated as if it were obvious, but even if it may be obvious to you, it may not be obvious to everyone. I would really appreciate it if there was a bit more explanation, rather than copy-pasting as if somehow I’d be able to read the mind of others

Topic		Replies	Views
Custom BCEWithLogitsLoss for Sequence Classification using Auto Model Beginners	1	23	May 26, 2025
Mullti Label Text Classification 🤗Transformers	2	1583	June 26, 2023
Multi-label token classification 🤗Transformers	34	7709	September 6, 2023
Multi label classification with large number of labels and sparse data 🤗Transformers	1	1533	July 15, 2023
Distilbert-base-multilingual-cased' Beginners	2	581	June 22, 2021

Why is BCELoss used for multi-label classification?

Related topics