Why is BCELoss used for multi-label classification?

I was looking into the code for some models (e.g. transformers/src/transformers/models/bert/modeling_bert.py at de4112e4d20795b27bad0050e30f324a1a3a26f2 · huggingface/transformers · GitHub), and noticed this:

            elif self.config.problem_type == "single_label_classification":
                loss_fct = CrossEntropyLoss()
                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
            elif self.config.problem_type == "multi_label_classification":
                loss_fct = BCEWithLogitsLoss()
                loss = loss_fct(logits, labels)

I don’t get why the code looks like this. If we have a binary classification problem, I’d expect to use BCE there. Is this a bug or have I missed something?

1 Like

machine learning - Should I use a categorical cross-entropy or binary cross-entropy loss for binary predictions? - Cross Validated (stackexchange.com)

Yes, exactly. Which just adds to my point. Why is BCE used when we have 2+ labels, i.e., "multi_label_classification"?

“For multi-label classification, the idea is the same. But instead of say 3 labels to indicate 3 classes, we have 6 labels to indicate presence or absence of each class (class1=1, class1=0, class2=1, class2=0, class3=1, and class3=0). The loss then is the sum of cross-entropy loss for each of these 6 classes.”

According to this comment its still binary within each class.

1 Like

Thanks for your answer. I appreciate you looking into this.

So — this is what I understand, feel free to correct my thinking:

My question is about why BCELoss is used for problems of type multi-label-classification (my assumption is that there may be 1+ true classes for a given data point).

A multi-label classifier would output O=(c_1,c_2,\dots,c_n), c \in [0,1].

My intuition would be to apply CELoss as:

loss = - \sum_{i}^{n} y_i * \log{c_i}

I really can’t see why this wouldn’t work. It’s the most standard formula ever.

Despite all, the implementation in HF Transformers uses a binary cross-entropy loss formula, which is a variation of the above for 2-label cases.

Then, you quote an answer in Cross Validated that says that you can reformulate a binary classification problem (e.g.):

Is Schrodinger’s cat dead or alive?

BCEloss=reality * \log{p(dead)} + (1 - reality) * \log{p(alive)}

As:

CEloss=reality_{dead} * \log{p(dead)} + (1 - reality_{dead}) * \log{(1 - p(dead))} + reality_{alive} * \log{p(alive)} + (1 - reality_{alive}) * \log{(1 - p(alive))}

Ok. All good with this. I absolutely agree that you can do this and it works with multi-label classification, also expanded to 2+ classes. But my question, the one in the first message, is still why can’t you use the usual CE loss?

I feel that you’re quoting that answer in Cross Validated as if it were obvious, but even if it may be obvious to you, it may not be obvious to everyone. I would really appreciate it if there was a bit more explanation, rather than copy-pasting as if somehow I’d be able to read the mind of others :cry: