Why is BCELoss used for multi-label classification?

Thanks for your answer. I appreciate you looking into this.

So — this is what I understand, feel free to correct my thinking:

My question is about why BCELoss is used for problems of type multi-label-classification (my assumption is that there may be 1+ true classes for a given data point).

A multi-label classifier would output O=(c_1,c_2,\dots,c_n), c \in [0,1].

My intuition would be to apply CELoss as:

loss = - \sum_{i}^{n} y_i * \log{c_i}

I really can’t see why this wouldn’t work. It’s the most standard formula ever.

Despite all, the implementation in HF Transformers uses a binary cross-entropy loss formula, which is a variation of the above for 2-label cases.

Then, you quote an answer in Cross Validated that says that you can reformulate a binary classification problem (e.g.):

Is Schrodinger’s cat dead or alive?

BCEloss=reality * \log{p(dead)} + (1 - reality) * \log{p(alive)}

As:

CEloss=reality_{dead} * \log{p(dead)} + (1 - reality_{dead}) * \log{(1 - p(dead))} + reality_{alive} * \log{p(alive)} + (1 - reality_{alive}) * \log{(1 - p(alive))}

Ok. All good with this. I absolutely agree that you can do this and it works with multi-label classification, also expanded to 2+ classes. But my question, the one in the first message, is still why can’t you use the usual CE loss?

I feel that you’re quoting that answer in Cross Validated as if it were obvious, but even if it may be obvious to you, it may not be obvious to everyone. I would really appreciate it if there was a bit more explanation, rather than copy-pasting as if somehow I’d be able to read the mind of others :cry: