Thanks for your answer. I appreciate you looking into this.
So — this is what I understand, feel free to correct my thinking:
My question is about why BCELoss is used for problems of type multi-label-classification
(my assumption is that there may be 1+ true classes for a given data point).
A multi-label classifier would output O=(c_1,c_2,\dots,c_n), c \in [0,1].
My intuition would be to apply CELoss as:
loss = - \sum_{i}^{n} y_i * \log{c_i}
I really can’t see why this wouldn’t work. It’s the most standard formula ever.
Despite all, the implementation in HF Transformers uses a binary cross-entropy loss formula, which is a variation of the above for 2-label cases.
Then, you quote an answer in Cross Validated that says that you can reformulate a binary classification problem (e.g.):
Is Schrodinger’s cat dead or alive?
BCEloss=reality * \log{p(dead)} + (1 - reality) * \log{p(alive)}
As:
CEloss=reality_{dead} * \log{p(dead)} + (1 - reality_{dead}) * \log{(1 - p(dead))} + reality_{alive} * \log{p(alive)} + (1 - reality_{alive}) * \log{(1 - p(alive))}
Ok. All good with this. I absolutely agree that you can do this and it works with multi-label classification, also expanded to 2+ classes. But my question, the one in the first message, is still why can’t you use the usual CE loss?
I feel that you’re quoting that answer in Cross Validated as if it were obvious, but even if it may be obvious to you, it may not be obvious to everyone. I would really appreciate it if there was a bit more explanation, rather than copy-pasting as if somehow I’d be able to read the mind of others